Research: Capability-Based and Microkernel Operating Systems
Survey of existing systems to inform capOS design decisions across IPC, scheduling, capability model, persistence, VFS, and language support.
This survey records the cross-system design consequences; the research index lists the individual deep-dive reports. Read the consequences below first; open individual reports only when their design context is relevant.
Design consequences for capOS
- Keep the flat generation-tagged capability table; seL4-style CNode hierarchy is not needed until delegation patterns demand it.
- Treat the typed Cap’n Proto interface as the permission boundary; avoid a parallel rights-bit system that would drift from schema semantics.
- Continue the ring transport plus direct-handoff IPC path, with shared memory
reserved for bulk data once
SharedBuffer/MemoryObjectexists. - Treat seL4-style endpoint badges as historical receiver metadata, not as the active service identity model; use move/copy transfer descriptors, object-epoch revocation, and session-bound invocation context to make authority delegation explicit and reviewable.
- Model session lifetime as revocable liveness state plus grant leases, not as generic capability expiry. EROS/CapTP-style revocation-by-indirection and Genode-style session closure are better precedents than refreshing every old reference in place.
- Keep persistence explicit through Store/Namespace capabilities; do not adopt EROS-style transparent global checkpointing as a kernel baseline.
- Push POSIX compatibility and VFS behavior into libraries and services rather than adding a kernel global filesystem namespace.
- Add resource donation, scheduling-context donation, notification objects, and runtime/thread primitives only when the corresponding service or runtime path needs them.
- Use Pingora-style lifecycle frameworks only above the capability transport: userspace service libraries can provide phase hooks, per-request context, readiness, graceful shutdown, retry policy, and observability, while kernel interfaces remain narrow typed capabilities with explicit authority.
Individual deep-dive reports:
- seL4 – formal verification, CNode/CSpace, IPC fastpath, MCS scheduling
- Fuchsia/Zircon – handles with rights, channels, VMARs/VMOs, ports, FIDL vs Cap’n Proto
- Plan 9 / Inferno – per-process namespaces, 9P protocol, file-based vs capability-based interfaces
- EROS / CapROS / Coyotos – persistent capabilities, single-level store, checkpoint/restart
- Genode – session routing, VFS plugins, POSIX compat, resource trading, Sculpt OS
- LLVM target customization – target triples, TLS models, Go runtime requirements
- Linux sandboxes and virtualization for workloads – Linux namespaces, cgroup v2, seccomp, Landlock, bubblewrap, nsjail, systemd-nspawn, OCI runtimes and images, User-Mode Linux, gVisor, QEMU/KVM, Firecracker, Kata Containers, and capOS auto full-nohz interaction grounding for generic Linux workload execution, familiar user environments, and agent-initiated jobs
- Cap’n Proto error handling – protocol, schema, and Rust crate error behavior used by the capOS error model
- Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web – Cloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and production Cap’n Proto/KJ lessons for capOS remote-capability design
- Spritely, OCapN, and CapTP – object capability network protocols, netlayers, locators, sturdyrefs, Syrup, promise pipelining, distributed GC, and third-party handoffs
- Browser engines, document engines, and agent browsers – mainstream browser engine portability, cap-native document-engine substrate options, automation protocols, Donut Browser-style profile orchestration, and implications for visual and agent/shell browser capabilities
- OS error handling – error patterns in capability systems and microkernels used by the capOS error model
- IX-on-capOS hosting – clean integration of IX package/build model via MicroPython control plane, native template rendering, Store/Namespace, and build services
- Out-of-kernel scheduling – whether scheduler policy can move to user space, and which dispatch/enforcement mechanisms must stay in kernel
- Completion rings and threaded runtimes –
completion ownership,
io_uring, futex, and IOCP precedents for capOS’s full-SMP ring/threading ABI - x2APIC and APIC virtualization – x2APIC backend direction, QEMU/KVM validation constraints, and why the current xAPIC MMIO LAPIC path should remain the Phase C foundation
- IOMMU remapping – primary-source Intel VT-d, AMD-Vi, and QEMU grounding for future real DMA remapping work, while current capOS remains diagnostics-only with direct DMA blocked
- Cloud DMA provider evidence inventory – official AWS/Azure/GCP device-surface facts, the evidence-matrix schema, the live guest-probe checklist, and the fail-closed classification rules the cloud DMA backend decision consumes
- Future scheduler architecture – Linux CFS/EEVDF, SCHED_DEADLINE, sched_ext, FreeBSD ULE, seL4 MCS, ghOSt, scheduler activations, Shenango, Caladan, Shinjuku, and Arachne lessons for capOS per-CPU queues, CPU accounting, fair scheduling, scheduling contexts, CPU isolation leases, realtime islands, and user-space scheduler policy
- NO_HZ, SQPOLL, and realtime scheduling – Linux NO_HZ, clocksource/clockevent, CPU isolation/housekeeping, io_uring SQPOLL, SCHED_DEADLINE, PREEMPT_RT, and seL4 MCS grounding for capOS tickless idle, SQPOLL nohz, scheduling contexts, and realtime islands
- HPC parallel patterns – Berkeley dwarfs, NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and OpenMP loop/task/reduction grounding for future single-node and multi-node parallel benchmark coverage
- Scientific agent-lab software stack – PARI/GP, SageMath, GAP, Singular, OSCAR, SymPy, SciPy, R, Octave, JupyterLab, Z3, cvc5, HiGHS, SCIP, OR-Tools, JuMP, CVXPY, Lean/mathlib, Rocq, Isabelle, Agda, Spack, Guix-HPC, Nix, and Apptainer grounding for a future capOS scientific standard package and LLM agent research lab
- Pingora – phase-oriented service framework design, operational lifecycle, pooling/retry lessons, and why capOS should borrow the userspace library shape without importing Pingora’s HTTP or process model. The concrete capOS follow-up is capos-service, starting with terminal/networking lifecycle rather than HTTP.
- Multimedia pipeline latency – PipeWire and JACK lessons for a capOS media graph optimized for the minimal possible guaranteed-stable stack latency, explicit latency ranges, admitted realtime islands, and xrun/deadline telemetry
- Realtime multimodal agent APIs – OpenAI Realtime, Google AI Gemini Live API, and Vertex AI Live API implications for capOS voice agent-shell, realtime media sessions, tool-call gating, and provider adapters
- Hosted agent harnesses – OpenClaw-like harness controls, hosted agent swarms, LLM-maintained wiki memory, schema-guided reasoning, MCP/A2A-style adapters, and implications for capability-scoped capOS agent services
- Game mechanics prior art – Stardew Valley, EVE Online, and Evil Islands mechanics translated into capability-shaped Aurelian Frontier calendar, market, construction, and combat tasks
- Robotics realtime control – ROS 2, micro-ROS, ros2_control, seL4 MCS, PREEMPT_RT, Xenomai, Orocos, Nav2, PX4, ArduPilot, Autoware, and OPC UA lessons for using capOS as a robot brain with explicit actuator authority and admitted realtime islands
Cross-Cutting Analysis
1. Capability Table Design
All surveyed systems store capabilities as process-local references to kernel objects. The key design variable is how capabilities are organized.
| System | Structure | Lookup | Delegation | Revocation |
|---|---|---|---|---|
| seL4 | Tree of CNodes (power-of-2 arrays with guard bits) | O(depth) | Subtree (grant CNode cap) | CDT (derivation tree), transitive |
| Zircon | Flat per-process handle table | O(1) | Transfer through channels (move) | Close handle; refcount; no propagation |
| EROS | 32-slot nodes forming trees | O(depth) | Node key passing | Forwarder keys (O(1) rescind) |
| Genode | Kernel-enforced capability references | O(1) | Parent-mediated session routing | Session close |
| capOS | Flat table with generation-tagged CapId, hold-edge metadata, and Arc<dyn CapObject> backing | O(1) | Manifest exports plus copy/move transfer descriptors through Endpoint IPC | Local release/process exit, object-epoch revocation for child-local grants, and target session liveness/grant-lease checks |
Recommendation for capOS: Keep the flat table. It is simpler than seL4’s CNode tree and sufficient for capOS’s use cases. Augment each entry with:
- Hold-edge metadata – transfer scope, disclosure scope, object id/epoch, and any legacy receiver metadata needed for transport compatibility.
- Generation counter (from Zircon) – upper bits of CapId detect stale references after a slot is reused. (Implemented.)
- Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.
- Session/grant lease reference (from Genode/EROS/CapTP-style lifecycle lessons) – a future pointer to mutable liveness or grant state so logout, renewal, and revocation do not require scanning all cap tables or relabeling a running process.
Not adopted: per-entry rights bitmask. Zircon and seL4 use rights bitmasks
(READ/WRITE/EXECUTE) because their handle/syscall interfaces are untyped.
capOS uses Cap’n Proto typed interfaces where the schema defines what methods
exist. Method-level access control is the interface itself – to restrict what
a caller can do, grant a narrower capability (a wrapper CapObject that
exposes fewer methods). A parallel rights system would create an impedance
mismatch: generic flags (READ/WRITE) mapped arbitrarily onto typed methods.
Meta-rights for the capability reference itself (TRANSFER/DUPLICATE) may be
added when Stage 6 IPC needs them. See Capability Model
for the full rationale.
2. IPC Design
IPC is the most performance-critical kernel mechanism. Every capability invocation across processes goes through it.
| System | Model | Latency (round-trip) | Bulk data | Async |
|---|---|---|---|---|
| seL4 | Synchronous endpoint, direct context switch | ~240 cycles (ARM), ~400 cycles (x86) | Shared memory (explicit) | Notification objects (bitmask signal/wait) |
| Zircon | Channels (async message queue, 64KiB + 64 handles) | ~3000-5000 cycles | VMOs (shared memory) | Ports (signal-based notification) |
| EROS | Synchronous domain call | ~2x L4 | Through address space nodes | None (synchronous only) |
| Plan 9 | 9P over pipes (kernel-mediated) | ~5000+ cycles | Large reads/writes (iounit) | None (blocking per-fid) |
| Genode | RPC objects with session routing | Varies by kernel (uses seL4/NOVA/Linux underneath) | Shared-memory dataspaces | Signal capabilities |
Recommendation for capOS: Continue the dual-path IPC design:
Fast synchronous path (seL4-inspired, for RPC):
- When process A calls a capability in process B and B is blocked waiting, perform a direct context switch (A -> kernel -> B, no unrelated scheduler pick). The current single-CPU direct handoff is implemented.
- Future fastpath work can transfer small messages (<64 bytes) through registers during the switch instead of copying through ring buffers.
Async submission/completion rings (io_uring-inspired, for batching):
- SQ/CQ in shared memory for batched capability invocations. This is the current transport for CALL/RECV/RETURN/RELEASE/NOP.
- Support SQE chaining for Cap’n Proto promise pipelining.
- Use Spritely/OCapN CapTP as the prior-art shape for remote capability sessions, third-party handoffs, answer namespaces, and distributed reference-release accounting, but do not treat current OCapN drafts as a frozen capOS ABI.
- Signal/notification delivery through CQ entries (from Zircon ports).
- User-queued CQ entries for userspace event loop integration.
Bulk data (Zircon/Genode-inspired):
SharedBuffercapability for zero-copy data transfer between processes.- Capnp messages for control plane; shared memory for data plane.
- Critical for file I/O, networking, and GPU rendering.
3. Memory Management Capabilities
Zircon’s VMO/VMAR model is the most mature capability-based memory design. The Go runtime proposal shows why these primitives are essential.
VirtualMemory capability (baseline implemented; still central for Go and advanced allocators):
interface VirtualMemory {
map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
unmap @1 (addr :UInt64, size :UInt64) -> ();
protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
MemoryObject capability (needed for IPC bulk data, shared libraries).
Zircon calls this concept a VMO (Virtual Memory Object); capOS uses the name
SharedBuffer – see docs/proposals/storage-and-naming-proposal.md for the canonical
interface definition.
interface MemoryObject {
read @0 (offset :UInt64, count :UInt64) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> ();
getSize @2 () -> (size :UInt64);
createChild @3 (offset :UInt64, size :UInt64, options :UInt32)
-> (child :MemoryObject);
}
4. Scheduling
| System | Model | Priority inversion solution | Temporal isolation |
|---|---|---|---|
| seL4 (MCS) | Scheduling Contexts (budget/period/priority) + Reply Objects | SC donation through IPC (caller’s SC transfers to callee) | Yes (budget enforcement per SC) |
| Zircon | Fair scheduler with profiles (deadline, capacity, period) | Kernel-managed priority inheritance | Profiles provide some isolation |
| Genode | Delegated to underlying kernel (seL4/NOVA/Linux) | Depends on kernel | Depends on kernel |
| Out-of-kernel policy | Kernel dispatch/enforcement + user-space policy service | Scheduling-context donation through IPC | Kernel-enforced budgets, user-chosen policy |
| User-space runtimes | M:N work stealing, fibers, async tasks over kernel threads | Requires futexes, runtime cooperation, and OS-visible blocking events | Usually runtime-local only |
Recommendation for capOS: Start with round-robin (already done). When implementing priority scheduling:
- Add scheduling context donation for synchronous IPC: when process A calls process B, B inherits A’s priority and budget. Prevents inversion through the capability graph.
- Support passive servers (from seL4 MCS): servers without their own scheduling context that only run when called, using the caller’s budget. Natural fit for capOS’s service architecture.
- Add temporal isolation (budget/period per scheduling context) for the cloud deployment scenario.
For moving scheduler policy out of the kernel, see Out-of-kernel scheduling. The key finding is a split between kernel dispatch/enforcement and user-space policy: dispatch, budget enforcement, and emergency fallback remain privileged, while admission control, budgets, priorities, CPU masks, and SQPOLL/core grants can be represented as policy managed by a scheduler service. Thread creation, thread handles, scheduling contexts, and park authority should be capability-based from the start; the remaining research task is measurement: compare generic capnp/ring calls against compact capability-authorized park-shaped operations before deciding the park hot-path encoding.
5. Persistence
| System | Model | Consistency | Application effort |
|---|---|---|---|
| EROS/CapROS | Transparent global checkpoint (single-level store) | Strong (global snapshot) | None (automatic) |
| Plan 9 | User-mode file servers with explicit writes | Per-file server | Full (explicit save/load) |
| Genode | Application-level (services manage own persistence) | Per-component | Full |
| capOS (planned) | Content-addressed Store + Namespace caps | Per-service | Full (explicit capnp serialize) |
Recommendation for capOS: Three phases, as informed by EROS:
- Explicit persistence (current plan) – services serialize state to the Store capability as capnp messages. Simple, gives services control.
- Opt-in Checkpoint capability – kernel captures process state (registers, memory, cap table) as capnp messages stored in the Store. Enables process migration and crash recovery for services that opt in.
- Coordinated checkpointing – a coordinator service orchestrates consistent snapshots across multiple services.
Persistent capability references (from EROS + Cap’n Proto):
struct PersistentCapRef {
interfaceId @0 :UInt64;
objectId @1 :UInt64;
permissions @2 :UInt32;
epoch @3 :UInt64;
}
Do NOT implement EROS-style transparent global persistence. The kernel complexity is enormous, debuggability is poor, and Cap’n Proto’s zero-copy serialization already provides near-equivalent benefits for explicit persistence.
6. Namespace and VFS
Plan 9’s per-process namespace is the closest analog to capOS’s per-process
capability table. The key insight: Plan 9’s bind/mount with union
semantics provides composability that capOS’s current Namespace design lacks.
Recommendation: Extend Namespace with union composition:
enum UnionMode { replace @0; before @1; after @2; }
interface Namespace {
resolve @0 (name :Text) -> (hash :Data);
bind @1 (name :Text, hash :Data) -> ();
list @2 () -> (names :List(Text));
sub @3 (prefix :Text) -> (ns :Namespace);
union @4 (other :Namespace, mode :UnionMode) -> (merged :Namespace);
}
VFS as a library (from Genode): libcapos-posix should be an in-process
library that translates POSIX calls to capability invocations. Each POSIX
process receives a declarative mount table (capnp struct) mapping paths to
capabilities. No VFS server needed.
FileServer capability (from Plan 9): For resources that are naturally
file-like (config trees, debug introspection, /proc-style interfaces),
provide a FileServer interface. Not universal (as in Plan 9) but available
where the file metaphor fits.
7. Resource Accounting
Genode’s session quota model addresses a gap in capOS: without resource accounting, a malicious client can exhaust a server’s memory by creating many sessions.
Recommendation: Session-creating capability methods should accept a resource donation parameter:
interface NetworkManager {
createTcpSocket @0 (bufferPages :UInt32) -> (socket :TcpSocket);
}
The client donates buffer memory as part of the session creation. The server allocates from donated resources, not its own.
8. Language Support Roadmap
From the LLVM research, the recommended order:
| Step | What | Blocks |
|---|---|---|
| 1 | Custom target JSON (x86_64-unknown-capos) | Done for booted userspace crates |
| 2 | VirtualMemory capability | Done for baseline map/unmap/protect; Go allocator glue remains |
| 3 | TLS support (PT_TLS parsing, FS base save/restore) | Done for static ELF processes and current-process ThreadControl; per-thread TLS remains |
| 4 | Park authority capability + measured ABI | Go threads, pthreads |
| 5 | Timer capability (monotonic clock) | Done for monotonic now/sleep; wall-clock and event timers remain future work |
| 6 | Go Phase 1: minimal GOOS=capos (single-threaded) | Runtime capability checkpoint done; Go fork remains |
| 7 | Kernel threading | Go GOMAXPROCS>1 |
| 8 | C toolchain + libcapos | C programs, musl |
| 9 | Go Phase 2: multi-threaded + concurrent GC | Go network services |
| 10 | Go Phase 3: network poller | net/http on capOS |
Key decisions:
- Keep
x86_64-unknown-nonefor kernel,x86_64-unknown-caposfor userspace. - Use
local-execTLS model (static linking, no dynamic linker). - Implement park as capability-authorized from the start. Because it operates on memory addresses and must be fast, measure generic capnp/ring calls against a compact capability-authorized operation before fixing the ABI.
- Go can start with cooperative-only preemption (no signals).
Recommendations by Roadmap Stage
Stage 5: Scheduling
| Source | Recommendation | Priority |
|---|---|---|
| Zircon | Generation counter in CapId (stale reference detection) | Done |
| seL4 | Add notification objects (lightweight bitmask signal/wait) | Medium |
| LLVM | Custom target JSON for userspace (x86_64-unknown-capos) | Done |
| LLVM | Per-thread TLS state for Go/threading | Medium |
Stage 6: IPC and Capability Transfer
| Source | Recommendation | Priority |
|---|---|---|
| seL4 | Direct-switch IPC for synchronous cross-process calls | Done baseline |
| seL4 | Badge field on capability entries for server-visible caller identity | Historical / rejected as service identity; see Rejected: Endpoint Badges as Service Identity |
| Zircon | Move semantics for capability transfer through IPC | Done |
| Zircon | MemoryObject capability (shared memory for bulk data) | Done baseline |
| EROS | Epoch-based revocation (O(1) revoke, O(1) check) | High |
| Zircon | Sideband capability-transfer descriptors and result-cap records | Done baseline |
| Genode | SharedBuffer capability for data-plane transfers | High |
| Plan 9 | Promise pipelining (SQE chaining in async rings) | Medium |
| Genode | Session quotas / resource donation on session creation | Medium |
| seL4 | Scheduling context donation through IPC | Medium |
| Plan 9 | Namespace union composition (before/after/replace) | Low |
Post-Stage 6 / Future
| Source | Recommendation | Priority |
|---|---|---|
| seL4 | MCS scheduling (passive servers, temporal isolation) | When needed |
| EROS | Opt-in Checkpoint capability for process persistence | When needed |
| Genode | Dynamic manifest reconfiguration at runtime | When needed |
| Plan 9 | exportfs-pattern capability proxy for network transparency | When needed |
| EROS | PersistentCapRef struct in capnp for storing capability graphs | When needed |
| seL4 | Rust-native formal verification (track Verus/Prusti) | Long-term |
Design Decisions Validated
Several capOS design choices are validated by this research:
-
Cap’n Proto as the universal wire format. Superior to FIDL (random access, zero-copy, promise pipelining, persistence-ready). The right choice. See Zircon Section 5.
-
Flat capability table. Simpler than seL4’s CNode tree, sufficient for capOS. Only add complexity (CNode-like hierarchy) if delegation patterns demand it. See seL4 Section 4.
-
No ambient authority. Every surveyed capability OS confirms this is essential. EROS proved confinement. seL4 proved integrity. capOS has this by design.
-
Explicit persistence over transparent. EROS’s single-level store is elegant but the kernel complexity is enormous. Cap’n Proto zero-copy gives most of the benefits. See EROS, CapROS, Coyotos Section 6.
-
io_uring-inspired async rings. Better than Zircon’s port model for capOS (operation-based > notification-based). See Zircon Section 4.
-
VFS as library, not kernel feature. Genode’s approach, matched by capOS’s planned
libcapos-posix. See Genode Section 3. -
No fork(). Genode has operated without fork() for 15+ years, proving it unnecessary. See Genode Section 4.
Design Gaps Identified
-
Bulk data path is only a substrate. Copying capnp messages through the kernel works for control but not for file/network/GPU data. MemoryObject now provides the mapped-frame substrate; service-facing SharedBuffer APIs remain future Stage 6+ work.
-
Resource accounting is partially unified. The authority-accounting design exists, and VirtualMemory plus FrameAllocator/MemoryObject frame grants now charge the process
ResourceLedger::frame_grant_pagescounter. Future shared-buffer, DMA, log-volume, and CPU-budget resources still need the same treatment. -
No notification primitive. seL4 notifications (lightweight bitmask signal/wait) are needed for interrupt delivery and event notification without full capnp message overhead.
-
No per-thread TLS object yet. Static ELF TLS, context-switch FS-base state, and current-process ThreadControl exist, but future user threads still need independently settable FS bases per thread.
References
See individual deep-dive reports for full reference lists. Key primary sources:
- Klein et al., “seL4: Formal Verification of an OS Kernel,” SOSP 2009
- Lyons et al., “Scheduling-context capabilities,” EuroSys 2018
- Shapiro et al., “EROS: A Fast Capability System,” SOSP 1999
- Shapiro & Weber, “Verifying the EROS Confinement Mechanism,” IEEE S&P 2000
- Pike et al., “The Use of Name Spaces in Plan 9,” OSR 1993
- Feske, “Genode Foundations” (genode.org/documentation)
- Fuchsia Zircon kernel documentation (fuchsia.dev)