Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Design Risks and Open Questions Register

Consolidated index of known design risks and open architectural questions for capOS. Every entry routes to the file that owns the long-form design or the remediation backlog for that risk; this register itself is a pointer document, not a place to put new design.

Use this document to answer “is this risk already tracked, and where?” without re-deriving the state from the proposal tree on each review.

Last refresh: 2026-04-29 11:52 UTC.

How To Use

  • Each design-risk row records the current observable state (what the code and docs say today), the owning tracker (the proposal/backlog/design file to update when the state changes), and the remaining gap (what is still open).
  • Each open-question row records a current answer if one exists in the tree, plus a pointer to the canonical tracker. Questions that are genuinely unanswered are marked Open; those should not be closed by guessing here – update the relevant proposal, then update this register.
  • When a risk is closed by code or by an explicit design decision, move the short closure summary into docs/changelog.md and remove the row.
  • New review findings still go into REVIEW_FINDINGS.md; this register is about long-horizon design risks, not concrete unresolved review issues.

Design Risks

R1 – Process-wide ring vs multi-threaded userspace and full SMP

  • State. The capability ring is one per process. capos-rt enforces a single-owner RuntimeRingClient. After in-process threading, at most one process-ring waiter is allowed. The first SMP Phase C AP scheduler-owner proof deliberately keeps process-wide ring execution on a single CPU at a time behind a scheduler-owner latch.
  • Owner. docs/proposals/ring-v2-smp-proposal.md, docs/research/completion-ring-threading.md, docs/backlog/smp-phase-c.md, docs/architecture/threading.md.
  • Gap. Per-thread capability rings, per-thread completion routing, and the Multi-Process / In-Process Threading Scalability milestones in docs/roadmap.md remain future work. Userspace threading scales only as far as the single ring waiter allows.

R2 – “Interface IS the permission” pushes safety into wrapper TCB

  • State. capOS deliberately has no parallel rights bitmask: attenuation is done by handing out a narrower CapObject wrapper, not a flag-reduced copy of the same cap. Wrapper correctness is therefore part of the trust base.
  • Owner. docs/capability-model.md, docs/proposals/session-bound-invocation-context-proposal.md, docs/security/trust-boundaries.md, docs/backlog/stage-6-capability-semantics.md.
  • Gap. The selected Session-Bound Invocation Context migration has the one-session-per-process proof, privacy-preserving endpoint caller-session metadata, explicit subject-disclosure coverage, chat session-keyed state, and terminal/stdio bridge liveness guards. Remaining cleanup is focused on peer-owned Adventure/shared-service migration, retiring remaining normal user-facing receiver-selector syntax, and final full-gate verification before treating the Tier-1 paper prerequisite as closed.

R3 – Legacy endpoint metadata as transitional service identity

  • State. Legacy endpoint receiver metadata is contained as internal transport/debug state for normal paths. Chat uses session-keyed membership, terminal/stdio bridges enforce live caller-session guards, and delegated relabeling containment plus the historical service-object routing/lifecycle proof have landed. Adventure/shared-service cleanup remains the visible tail of the selected migration.
  • Owner. docs/proposals/session-bound-invocation-context-proposal.md, docs/backlog/stage-6-capability-semantics.md.
  • Gap. Finish Adventure/shared-service migration and final legacy cleanup. Receiver metadata must remain internal transport state or hostile-test fixture, not subject identity or disclosure.

R4 – Resource accounting is fragmented

  • State. Per-process memory, cap-table, and thread quotas exist; ResourceProfile, session quotas, scheduling-context donation, and cross-service donation/fairness are still proposal-shaped.
  • Owner. docs/proposals/resource-accounting-proposal.md, docs/proposals/oom-and-swap-proposal.md, docs/proposals/user-identity-and-policy-proposal.md, docs/proposals/system-monitoring-proposal.md.
  • Gap. CPU accounting/scheduling contexts, log volume accounting, per-service fairness, donation semantics, and unified resource bundles for guest/anonymous/external/service principals are not implemented.

R5 – Copy-transfer SQE replay is repeatable by design

  • State. docs/authority-accounting-transfer-design.md documents that userspace replay of a copy-transfer SQE is repeatable per dispatch attempt, with move-transfer replay failing closed once the source slot is removed/reserved. Exactly-once replay suppression is explicitly future work (security invariant T3).
  • Owner. docs/authority-accounting-transfer-design.md, docs/proposals/security-and-verification-proposal.md.
  • Gap. The (sender_pid, call_id, sqe_seq) plus monotonic transfer-epoch identity needed for exactly-once replay across dispatch attempts is not implemented. Each transferable interface must continue to acknowledge this in its threat model.

R6 – CAP_OP_RELEASE is deferred / queued, not synchronous

  • State. Owned-handle drop in capos-rt queues one local CAP_OP_RELEASE on the ring; process exit performs fallback cleanup. Release does not run before the next ring flush (cap_enter or process exit).
  • Owner. docs/authority-accounting-transfer-design.md, docs/proposals/error-handling-proposal.md, docs/capability-model.md.
  • Gap. Resource-pressure or revocation-sensitive flows must not assume a Drop call has already taken effect at the kernel layer. Time-critical revocation should use CapabilityManager.revoke or epoch revocation rather than relying on Drop.

R7 – Shared memory / zero-copy / shared park are incomplete

  • State. MemoryObject substrate exists; SharedBuffer provenance, file/network/DMA zero-copy paths, and shared park/SharedParkSpace are blocked on mapping provenance / object pinning work.
  • Owner. docs/proposals/storage-and-naming-proposal.md, docs/proposals/networking-proposal.md, docs/architecture/park.md, docs/backlog/runtime-network-shell.md.
  • Gap. Workloads that need true zero-copy IPC, storage, or network pipelines pay a copy/serialization cost until provenance/pinning lands. ParkSpace private waiters on reusable unmapped addresses remain restricted by documentation until VM-unmap stale-key cleanup and tests land.

R8 – Networking lives inside the kernel TCB

  • State. virtio-net, ARP/ICMP, the smoltcp runtime, and the TCP CapObjects currently run in the kernel address space behind capability objects. The Telnet and SSH terminal-host proofs are built on this path.
  • Owner. docs/proposals/networking-proposal.md, docs/dma-isolation-design.md, docs/backlog/runtime-network-shell.md.
  • Gap. Userspace NIC driver and userspace TCP stack are Phase C / future work; until then the network stack temporarily expands the kernel TCB against the long-term service-decomposition direction.

R9 – DMA threat model assumes cooperative virtio

  • State. docs/dma-isolation-design.md records that the first virtio-net smoke uses kernel-owned bounce buffers, does not expose userspace DMA buffers or physical addresses, and explicitly assumes a non-hostile QEMU virtio device. Without an IOMMU a malicious bus master can DMA arbitrary RAM.
  • Owner. docs/dma-isolation-design.md, docs/proposals/networking-proposal.md, docs/proposals/cloud-deployment-proposal.md.
  • Gap. DMAPool / DeviceMmio / Interrupt userspace-driver gating, and IOMMU-backed isolation for production hardware, are not implemented.

R10 – Boot package model embeds all binaries

  • State. tools/mkmanifest embeds every declared binary as a NamedBlob inside manifest.bin. The kernel loads only init; everything else is fetched by init from the in-memory BootPackage.
  • Owner. docs/backlog/hardware-boot-storage.md, docs/proposals/storage-and-naming-proposal.md, docs/trusted-build-inputs.md.
  • Gap. Boot binary ISO layout (separate ELF payloads), package/storage update model, and persistent storage-backed delivery are not yet designed as code; the current scheme is an explicit prototype compromise.

R11 – Pre-auth and post-auth share a shell process

  • State. The shell-led boot flow folds console-login into capos-shell and uses an anonymous-first session that escalates via login/setup. The pre-auth and post-auth code paths run in one userspace process and address space.
  • Owner. docs/proposals/boot-to-shell-proposal.md, docs/proposals/shell-proposal.md, docs/security/trust-boundaries.md, docs/proposals/user-identity-and-policy-proposal.md.
  • Gap. Separation depends on shell/auth implementation quality, not on a process boundary. The future direction (separate login service with minimal authority, restricted launchers, WebShell/SshGateway) is proposal-shaped. Remote and non-loopback shells must remain blocked until pre-auth and post-auth authority are process-isolated or a shared-process proof is accepted.

R16 – Remote shell ingress is demo/prototype only

  • State. Telnet is a plaintext loopback-only QEMU demo. SSH has SSH-shaped prerequisites, fixture authentication proofs, dev key material, policy classification, and restricted-shell launcher coverage, but no production encrypted SSH transport, durable key/account storage, full OpenSSH-compatible userauth/channel handling, channel binding, or complete audit/storage gates.
  • Owner. docs/proposals/ssh-shell-proposal.md, docs/backlog/runtime-network-shell.md, WORKPLAN.md, docs/build-run-test.md.
  • Gap. Production/non-loopback shell exposure is blocked on SSH transport, key, account, audit, storage, session-bound delegation, and pre-auth/post-auth isolation gates.

R12 – Verification coverage is partial, not full proof

  • State. Bounded Kani gate (make kani-lib/make kani-lib-full), Loom ring model, Miri lib tests, proptest, fuzz harnesses, panic-surface inventory, and CI dependency policy exist. Coverage is not whole-system and not seL4-style functional refinement.
  • Owner. docs/proposals/security-and-verification-proposal.md, docs/security/verification-workflow.md, docs/panic-surface-inventory.md, docs/backlog/security-verification.md.
  • Gap. Public/external claims must distinguish “bounded model checked” from “fully verified”. Promote new properties into Kani/Loom only when the invariant is concrete and bounded.

R13 – Trusted build inputs are partly pinned

  • State. Limine (commit + artifact SHA-256), capnp 1.2.0 source tarball, CUE 0.16.0, mdBook/mdbook-mermaid, Typst 0.14.2, Cargo lockfiles, and the Kani toolchain bundle are pinned. The Rust nightly toolchain is a floating channel without a date or hash pin; xorriso, qemu-system-x86_64, and OVMF firmware are observed-not-pinned.
  • Owner. docs/trusted-build-inputs.md.
  • Gap. Reproducible-build pinning for the Rust nightly (date/hash), host ISO/firmware tools, and final ISO/payload checksums is unfinished; the document already lists these as open S.10.x gaps.

R14 – User identity / policy is proposal-shaped

  • State. Anonymous/operator sessions, password setup/login, broker-issued shell bundles, and redacted audit records exist. Durable accounts, ABAC/MAC context, OIDC/passkeys, disk-backed account stores, and resource bundles are proposal-shaped.
  • Owner. docs/proposals/user-identity-and-policy-proposal.md, docs/backlog/local-users-management.md, docs/proposals/oidc-and-oauth2-proposal.md, docs/proposals/certificates-and-tls-proposal.md, docs/proposals/cryptography-and-key-management-proposal.md.
  • Gap. Until durable identity / persistence / passkey paths land, capOS is not a complete multi-user OS. Demo claims must scope to the proven anonymous + operator + manifest-seeded local accounts model.

R15 – App exception serialization depends on result-buffer capacity

  • State. Application-level exceptions are serialized into the caller’s result buffer; if the target cannot be identified, invocation fails earlier with transport errors. Truncation/transport failures are documented.
  • Owner. docs/proposals/error-handling-proposal.md, docs/capability-model.md.
  • Gap. Service UX/debuggability can degrade for malformed or small-buffer clients. No remediation is required in code today, but each service contract should document its expected result-buffer capacity.

Open Design Questions

The following questions came up in external review. Each row gives the current best answer observed in the tree, the canonical tracker to update, and an explicit status.

Q1 – Cap’n Proto ABI compatibility policy

  • Current answer. Schema interface IDs / method IDs / struct layout are reviewed for stability via make generated-code-check and tools/check-generated-capnp.sh. There is no published ABI-compatibility guarantee window yet.
  • Tracker. docs/proposals/error-handling-proposal.md, docs/trusted-build-inputs.md, schema/capos.capnp.
  • Status. Open. A formal ABI-evolution policy (interface-id stability, reserved-field semantics, deprecation window) needs to be authored before external consumers depend on the schema.

Q2 – Ring v2 backward compatibility

  • Current answer. docs/proposals/ring-v2-smp-proposal.md treats per-thread ring ownership as the full-SMP target and frames it as an evolution that may need ABI changes; WORKPLAN.md calls runtime ring reactor work the compatibility bridge.
  • Tracker. docs/proposals/ring-v2-smp-proposal.md, docs/backlog/smp-phase-c.md.
  • Status. Open. Whether Ring v2 is backward-compatible with the process-wide ring or an explicit ABI break has not been decided.

Q3 – Which capabilities are copy-transferable vs move-only vs non-transferable

  • Current answer. docs/authority-accounting-transfer-design.md defines copy/move/none transfer modes and the accounting/rollback rules. Per-interface transfer mode is encoded on the schema-defined CapObject.
  • Tracker. docs/authority-accounting-transfer-design.md, schema/capos.capnp.
  • Status. Partial. The mode is enforced per object, but the user-visible matrix (which named caps are copy/move/none) is not consolidated in one document.

Q4 – Copy-transfer replay: feature or compromise

  • Current answer. Repeatable copy-transfer replay is documented as the current accepted semantics. Exactly-once replay suppression is future work. See R5.
  • Tracker. docs/authority-accounting-transfer-design.md.
  • Status. Decided as “current semantics, future tightening optional”.

Q5 – When legacy endpoint identity is replaced and what migrates

  • Current answer. docs/backlog/session-bound-invocation-context.md decomposes the selected migration: one immutable session context per process, privacy-preserving endpoint caller-session metadata, chat/adventure/stdio session-keyed migration, and legacy endpoint-identity cleanup. The old service-object identity plan is superseded.
  • Tracker. docs/proposals/session-bound-invocation-context-proposal.md, docs/backlog/session-bound-invocation-context.md, docs/backlog/stage-6-capability-semantics.md.
  • Status. Selected milestone. See R3.

Q6 – Minimum production TCB target

  • Current answer. docs/proposals/security-and-verification-proposal.md now enumerates the current demo/proof TCB and the target production TCB. Current proofs still trust kernel networking, init/supervisors, broker/session services, harnesses, and QEMU virtio. The target production TCB removes ordinary apps and shell children but still includes minimal init/supervisor, credential/session/broker/key/audit services, production device managers, and ABI/schema/build-signature inputs.
  • Tracker. docs/security/trust-boundaries.md, docs/proposals/userspace-authority-broker-proposal.md, docs/proposals/boot-to-shell-proposal.md.
  • Status. Partially answered. The TCB statement exists; reducing the actual implementation to that target and proving the non-loopback shell gates remains open.

Q7 – Revocation strategy

  • Current answer. Generation/epoch revocation exists for endpoint-backed caps; CapabilityManager.revoke cleans up endpoint-backed service objects by object behavior. Revocation trees, leases, and supervisor-owned-cap patterns are proposal-shaped.
  • Tracker. docs/proposals/service-architecture-proposal.md, docs/proposals/session-bound-invocation-context-proposal.md, docs/capability-model.md.
  • Status. Open. The chosen revocation primitive set (epochs vs trees vs leases vs explicit-revoke methods per object) needs an explicit decision.

Q8 – Boundary between kernel and service-level resource accounting

  • Current answer. Memory frame grants and cap-table slots are kernel accounting; storage/network buffer accounting is proposed at the service layer. The boundary is not yet implementation-driven.
  • Tracker. docs/proposals/resource-accounting-proposal.md, docs/proposals/storage-and-naming-proposal.md, docs/proposals/networking-proposal.md.
  • Status. Open.

Q9 – CPU accounting and scheduling contexts

  • Current answer. Round-robin is the current scheduler. Donation, priority inheritance, fixed budgets, and explicit scheduling capabilities are proposed but not implemented.
  • Tracker. docs/proposals/smp-proposal.md, docs/proposals/resource-accounting-proposal.md, docs/architecture/scheduling.md.
  • Status. Open. Pick a CPU accounting model before Multi-Process SMP Concurrency or In-Process Threading Scalability milestones land.

Q10 – IOMMU requirement for userspace networking

  • Current answer. docs/dma-isolation-design.md keeps the kernel-owned bounce-buffer model for QEMU virtio and explicitly defers an IOMMU requirement to the userspace driver gate.
  • Tracker. docs/dma-isolation-design.md, docs/proposals/networking-proposal.md, docs/proposals/cloud-deployment-proposal.md.
  • Status. Open. The production answer (IOMMU-required vs continued bounce buffers vs both) depends on platform support and will be decided at the userspace driver gate.

Q11 – Capability persistence model

  • Current answer. All capabilities are runtime-only today; sealed/stored caps and namespace-mediated reconstitution are storage-proposal scope.
  • Tracker. docs/proposals/storage-and-naming-proposal.md, docs/proposals/volume-encryption-proposal.md, docs/paper/plan.md (paper-scoped persistence Tier-1 prerequisite).
  • Status. Open.

Q12 – Least-privilege shell command invocation

  • Current answer. capos-shell runs commands using broker-issued bundles; the broker, not the shell, is the policy decision point. RestrictedShellLauncher keeps remote shell launches off raw spawn authority.
  • Tracker. docs/proposals/shell-proposal.md, docs/proposals/userspace-authority-broker-proposal.md, docs/proposals/boot-to-shell-proposal.md.
  • Status. Direction agreed, complete migration to broker-only authority for every shell-driven invocation is open.

Q13 – Formal properties to prove

  • Current answer. Existing bounded proofs cover cap-table non-forgery, frame-bitmap invariants, transfer rollback, and ring producer-consumer invariants. seL4-style full functional refinement is explicitly out of scope.
  • Tracker. docs/proposals/security-and-verification-proposal.md, docs/security/verification-workflow.md, docs/proposals/formal-mac-mic-proposal.md.
  • Status. Partially answered. A definitive list of “what we will keep proving” vs “what we will keep testing” should be added when the next Kani/Loom obligation set is concrete.

Q14 – Threat model coverage

  • Current answer. docs/proposals/security-and-verification-proposal.md now contains a threat actor matrix for local physical attackers, malicious DMA devices, malicious boot manifests, compromised init/supervisors, compromised narrow services, hostile network peers, and malicious build dependencies.
  • Tracker. docs/security/trust-boundaries.md, docs/proposals/security-and-verification-proposal.md, docs/dma-isolation-design.md, docs/trusted-build-inputs.md.
  • Status. Answered at design level. Remaining work is implementation/proof for the gates listed in REVIEW_FINDINGS.md.

Q15 – Language runtimes integration model

  • Current answer. capos-rt is the canonical no_std Rust runtime; Go, Lua, WASI, and POSIX adapters are proposal-shaped and currently framed as separate runtimes consuming generated capnp clients, not as a shared C/WASI ABI layer.
  • Tracker. docs/proposals/userspace-binaries-proposal.md, docs/proposals/go-runtime-proposal.md, docs/proposals/lua-scripting-proposal.md.
  • Status. Open. A common ABI layer vs per-runtime generated clients has not been decided; the current default is per-runtime clients.