Research: Capability-Based and Microkernel Operating Systems
Survey of existing systems to inform capOS design decisions across IPC, scheduling, capability model, persistence, VFS, and language support.
Design consequences for capOS
- Keep the flat generation-tagged capability table; seL4-style CNode hierarchy is not needed until delegation patterns demand it.
- Treat the typed Cap’n Proto interface as the permission boundary; avoid a parallel rights-bit system that would drift from schema semantics.
- Continue the ring transport plus direct-handoff IPC path, with shared memory
reserved for bulk data once
SharedBuffer/MemoryObjectexists. - Use badge metadata, move/copy transfer descriptors, and future epoch revocation to make authority delegation explicit and reviewable.
- Keep persistence explicit through Store/Namespace capabilities; do not adopt EROS-style transparent global checkpointing as a kernel baseline.
- Push POSIX compatibility and VFS behavior into libraries and services rather than adding a kernel global filesystem namespace.
- Add resource donation, scheduling-context donation, notification objects, and runtime/thread primitives only when the corresponding service or runtime path needs them.
Individual deep-dive reports:
- seL4 – formal verification, CNode/CSpace, IPC fastpath, MCS scheduling
- Fuchsia/Zircon – handles with rights, channels, VMARs/VMOs, ports, FIDL vs Cap’n Proto
- Plan 9 / Inferno – per-process namespaces, 9P protocol, file-based vs capability-based interfaces
- EROS / CapROS / Coyotos – persistent capabilities, single-level store, checkpoint/restart
- Genode – session routing, VFS plugins, POSIX compat, resource trading, Sculpt OS
- LLVM target customization – target triples, TLS models, Go runtime requirements
- Cap’n Proto error handling – protocol, schema, and Rust crate error behavior used by the capOS error model
- OS error handling – error patterns in capability systems and microkernels used by the capOS error model
- IX-on-capOS hosting – clean integration of IX package/build model via MicroPython control plane, native template rendering, Store/Namespace, and build services
- Out-of-kernel scheduling – whether scheduler policy can move to user space, and which dispatch/enforcement mechanisms must stay in kernel
Cross-Cutting Analysis
1. Capability Table Design
All surveyed systems store capabilities as process-local references to kernel objects. The key design variable is how capabilities are organized.
| System | Structure | Lookup | Delegation | Revocation |
|---|---|---|---|---|
| seL4 | Tree of CNodes (power-of-2 arrays with guard bits) | O(depth) | Subtree (grant CNode cap) | CDT (derivation tree), transitive |
| Zircon | Flat per-process handle table | O(1) | Transfer through channels (move) | Close handle; refcount; no propagation |
| EROS | 32-slot nodes forming trees | O(depth) | Node key passing | Forwarder keys (O(1) rescind) |
| Genode | Kernel-enforced capability references | O(1) | Parent-mediated session routing | Session close |
| capOS | Flat table with generation-tagged CapId, hold-edge metadata, and Arc<dyn CapObject> backing | O(1) | Manifest exports plus copy/move transfer descriptors through Endpoint IPC | Local release/process exit; epoch revocation not yet |
Recommendation for capOS: Keep the flat table. It is simpler than seL4’s CNode tree and sufficient for capOS’s use cases. Augment each entry with:
- Badge (from seL4) – u64 value delivered to the server on invocation, allowing a server to distinguish callers without separate capability objects.
- Generation counter (from Zircon) – upper bits of CapId detect stale references after a slot is reused. (Implemented.)
- Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.
Not adopted: per-entry rights bitmask. Zircon and seL4 use rights bitmasks
(READ/WRITE/EXECUTE) because their handle/syscall interfaces are untyped.
capOS uses Cap’n Proto typed interfaces where the schema defines what methods
exist. Method-level access control is the interface itself – to restrict what
a caller can do, grant a narrower capability (a wrapper CapObject that
exposes fewer methods). A parallel rights system would create an impedance
mismatch: generic flags (READ/WRITE) mapped arbitrarily onto typed methods.
Meta-rights for the capability reference itself (TRANSFER/DUPLICATE) may be
added when Stage 6 IPC needs them. See capability-model.md
for the full rationale.
2. IPC Design
IPC is the most performance-critical kernel mechanism. Every capability invocation across processes goes through it.
| System | Model | Latency (round-trip) | Bulk data | Async |
|---|---|---|---|---|
| seL4 | Synchronous endpoint, direct context switch | ~240 cycles (ARM), ~400 cycles (x86) | Shared memory (explicit) | Notification objects (bitmask signal/wait) |
| Zircon | Channels (async message queue, 64KiB + 64 handles) | ~3000-5000 cycles | VMOs (shared memory) | Ports (signal-based notification) |
| EROS | Synchronous domain call | ~2x L4 | Through address space nodes | None (synchronous only) |
| Plan 9 | 9P over pipes (kernel-mediated) | ~5000+ cycles | Large reads/writes (iounit) | None (blocking per-fid) |
| Genode | RPC objects with session routing | Varies by kernel (uses seL4/NOVA/Linux underneath) | Shared-memory dataspaces | Signal capabilities |
Recommendation for capOS: Continue the dual-path IPC design:
Fast synchronous path (seL4-inspired, for RPC):
- When process A calls a capability in process B and B is blocked waiting, perform a direct context switch (A -> kernel -> B, no unrelated scheduler pick). The current single-CPU direct handoff is implemented.
- Future fastpath work can transfer small messages (<64 bytes) through registers during the switch instead of copying through ring buffers.
Async submission/completion rings (io_uring-inspired, for batching):
- SQ/CQ in shared memory for batched capability invocations. This is the current transport for CALL/RECV/RETURN/RELEASE/NOP.
- Support SQE chaining for Cap’n Proto promise pipelining.
- Signal/notification delivery through CQ entries (from Zircon ports).
- User-queued CQ entries for userspace event loop integration.
Bulk data (Zircon/Genode-inspired):
SharedBuffercapability for zero-copy data transfer between processes.- Capnp messages for control plane; shared memory for data plane.
- Critical for file I/O, networking, and GPU rendering.
3. Memory Management Capabilities
Zircon’s VMO/VMAR model is the most mature capability-based memory design. The Go runtime proposal shows why these primitives are essential.
VirtualMemory capability (baseline implemented; still central for Go and advanced allocators):
interface VirtualMemory {
map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
unmap @1 (addr :UInt64, size :UInt64) -> ();
protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
MemoryObject capability (needed for IPC bulk data, shared libraries).
Zircon calls this concept a VMO (Virtual Memory Object); capOS uses the name
SharedBuffer – see docs/proposals/storage-and-naming-proposal.md for the canonical
interface definition.
interface MemoryObject {
read @0 (offset :UInt64, count :UInt64) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> ();
getSize @2 () -> (size :UInt64);
createChild @3 (offset :UInt64, size :UInt64, options :UInt32)
-> (child :MemoryObject);
}
4. Scheduling
| System | Model | Priority inversion solution | Temporal isolation |
|---|---|---|---|
| seL4 (MCS) | Scheduling Contexts (budget/period/priority) + Reply Objects | SC donation through IPC (caller’s SC transfers to callee) | Yes (budget enforcement per SC) |
| Zircon | Fair scheduler with profiles (deadline, capacity, period) | Kernel-managed priority inheritance | Profiles provide some isolation |
| Genode | Delegated to underlying kernel (seL4/NOVA/Linux) | Depends on kernel | Depends on kernel |
| Out-of-kernel policy | Kernel dispatch/enforcement + user-space policy service | Scheduling-context donation through IPC | Kernel-enforced budgets, user-chosen policy |
| User-space runtimes | M:N work stealing, fibers, async tasks over kernel threads | Requires futexes, runtime cooperation, and OS-visible blocking events | Usually runtime-local only |
Recommendation for capOS: Start with round-robin (already done). When implementing priority scheduling:
- Add scheduling context donation for synchronous IPC: when process A calls process B, B inherits A’s priority and budget. Prevents inversion through the capability graph.
- Support passive servers (from seL4 MCS): servers without their own scheduling context that only run when called, using the caller’s budget. Natural fit for capOS’s service architecture.
- Add temporal isolation (budget/period per scheduling context) for the cloud deployment scenario.
For moving scheduler policy out of the kernel, see Out-of-kernel scheduling. The key finding is a split between kernel dispatch/enforcement and user-space policy: dispatch, budget enforcement, and emergency fallback remain privileged, while admission control, budgets, priorities, CPU masks, and SQPOLL/core grants can be represented as policy managed by a scheduler service. Thread creation, thread handles, scheduling contexts, and futex authority should be capability-based from the start; the remaining research task is measurement: compare generic capnp/ring calls against compact capability-authorized futex-shaped operations before deciding the futex hot-path encoding.
5. Persistence
| System | Model | Consistency | Application effort |
|---|---|---|---|
| EROS/CapROS | Transparent global checkpoint (single-level store) | Strong (global snapshot) | None (automatic) |
| Plan 9 | User-mode file servers with explicit writes | Per-file server | Full (explicit save/load) |
| Genode | Application-level (services manage own persistence) | Per-component | Full |
| capOS (planned) | Content-addressed Store + Namespace caps | Per-service | Full (explicit capnp serialize) |
Recommendation for capOS: Three phases, as informed by EROS:
- Explicit persistence (current plan) – services serialize state to the Store capability as capnp messages. Simple, gives services control.
- Opt-in Checkpoint capability – kernel captures process state (registers, memory, cap table) as capnp messages stored in the Store. Enables process migration and crash recovery for services that opt in.
- Coordinated checkpointing – a coordinator service orchestrates consistent snapshots across multiple services.
Persistent capability references (from EROS + Cap’n Proto):
struct PersistentCapRef {
interfaceId @0 :UInt64;
objectId @1 :UInt64;
permissions @2 :UInt32;
epoch @3 :UInt64;
}
Do NOT implement EROS-style transparent global persistence. The kernel complexity is enormous, debuggability is poor, and Cap’n Proto’s zero-copy serialization already provides near-equivalent benefits for explicit persistence.
6. Namespace and VFS
Plan 9’s per-process namespace is the closest analog to capOS’s per-process
capability table. The key insight: Plan 9’s bind/mount with union
semantics provides composability that capOS’s current Namespace design lacks.
Recommendation: Extend Namespace with union composition:
enum UnionMode { replace @0; before @1; after @2; }
interface Namespace {
resolve @0 (name :Text) -> (hash :Data);
bind @1 (name :Text, hash :Data) -> ();
list @2 () -> (names :List(Text));
sub @3 (prefix :Text) -> (ns :Namespace);
union @4 (other :Namespace, mode :UnionMode) -> (merged :Namespace);
}
VFS as a library (from Genode): libcapos-posix should be an in-process
library that translates POSIX calls to capability invocations. Each POSIX
process receives a declarative mount table (capnp struct) mapping paths to
capabilities. No VFS server needed.
FileServer capability (from Plan 9): For resources that are naturally
file-like (config trees, debug introspection, /proc-style interfaces),
provide a FileServer interface. Not universal (as in Plan 9) but available
where the file metaphor fits.
7. Resource Accounting
Genode’s session quota model addresses a gap in capOS: without resource accounting, a malicious client can exhaust a server’s memory by creating many sessions.
Recommendation: Session-creating capability methods should accept a resource donation parameter:
interface NetworkManager {
createTcpSocket @0 (bufferPages :UInt32) -> (socket :TcpSocket);
}
The client donates buffer memory as part of the session creation. The server allocates from donated resources, not its own.
8. Language Support Roadmap
From the LLVM research, the recommended order:
| Step | What | Blocks |
|---|---|---|
| 1 | Custom target JSON (x86_64-unknown-capos) | Optional build hygiene; not required by current no_std runtime |
| 2 | VirtualMemory capability | Done for baseline map/unmap/protect; Go allocator glue remains |
| 3 | TLS support (PT_TLS parsing, FS base save/restore) | Done for static ELF processes; runtime-controlled FS base remains |
| 4 | Futex authority capability + measured ABI | Go threads, pthreads |
| 5 | Timer capability (monotonic clock) | Go scheduler |
| 6 | Go Phase 1: minimal GOOS=capos (single-threaded) | CUE on capOS |
| 7 | Kernel threading | Go GOMAXPROCS>1 |
| 8 | C toolchain + libcapos | C programs, musl |
| 9 | Go Phase 2: multi-threaded + concurrent GC | Go network services |
| 10 | Go Phase 3: network poller | net/http on capOS |
Key decisions:
- Keep
x86_64-unknown-nonefor kernel,x86_64-unknown-caposfor userspace. - Use
local-execTLS model (static linking, no dynamic linker). - Implement futex as capability-authorized from the start. Because it operates on memory addresses and must be fast, measure generic capnp/ring calls against a compact capability-authorized operation before fixing the ABI.
- Go can start with cooperative-only preemption (no signals).
Recommendations by Roadmap Stage
Stage 5: Scheduling
| Source | Recommendation | Priority |
|---|---|---|
| Zircon | Generation counter in CapId (stale reference detection) | Done |
| seL4 | Add notification objects (lightweight bitmask signal/wait) | Medium |
| LLVM | Custom target JSON for userspace (x86_64-unknown-capos) | Medium |
| LLVM | Runtime-controlled FS-base operation for Go/threading | Medium |
Stage 6: IPC and Capability Transfer
| Source | Recommendation | Priority |
|---|---|---|
| seL4 | Direct-switch IPC for synchronous cross-process calls | Done baseline |
| seL4 | Badge field on capability entries (server identifies callers) | Done |
| Zircon | Move semantics for capability transfer through IPC | Done |
| Zircon | MemoryObject capability (shared memory for bulk data) | High |
| EROS | Epoch-based revocation (O(1) revoke, O(1) check) | High |
| Zircon | Sideband capability-transfer descriptors and result-cap records | Done baseline |
| Genode | SharedBuffer capability for data-plane transfers | High |
| Plan 9 | Promise pipelining (SQE chaining in async rings) | Medium |
| Genode | Session quotas / resource donation on session creation | Medium |
| seL4 | Scheduling context donation through IPC | Medium |
| Plan 9 | Namespace union composition (before/after/replace) | Low |
Post-Stage 6 / Future
| Source | Recommendation | Priority |
|---|---|---|
| seL4 | MCS scheduling (passive servers, temporal isolation) | When needed |
| EROS | Opt-in Checkpoint capability for process persistence | When needed |
| Genode | Dynamic manifest reconfiguration at runtime | When needed |
| Plan 9 | exportfs-pattern capability proxy for network transparency | When needed |
| EROS | PersistentCapRef struct in capnp for storing capability graphs | When needed |
| seL4 | Rust-native formal verification (track Verus/Prusti) | Long-term |
Design Decisions Validated
Several capOS design choices are validated by this research:
-
Cap’n Proto as the universal wire format. Superior to FIDL (random access, zero-copy, promise pipelining, persistence-ready). The right choice. See zircon.md Section 5.
-
Flat capability table. Simpler than seL4’s CNode tree, sufficient for capOS. Only add complexity (CNode-like hierarchy) if delegation patterns demand it. See sel4.md Section 4.
-
No ambient authority. Every surveyed capability OS confirms this is essential. EROS proved confinement. seL4 proved integrity. capOS has this by design.
-
Explicit persistence over transparent. EROS’s single-level store is elegant but the kernel complexity is enormous. Cap’n Proto zero-copy gives most of the benefits. See eros-capros-coyotos.md Section 6.
-
io_uring-inspired async rings. Better than Zircon’s port model for capOS (operation-based > notification-based). See zircon.md Section 4.
-
VFS as library, not kernel feature. Genode’s approach, matched by capOS’s planned
libcapos-posix. See genode.md Section 3. -
No fork(). Genode has operated without fork() for 15+ years, proving it unnecessary. See genode.md Section 4.
Design Gaps Identified
-
No bulk data path. Copying capnp messages through the kernel works for control but not for file/network/GPU data. SharedBuffer or MemoryObject capability is essential for Stage 6+.
-
Resource accounting is fragmented. The authority-accounting design exists and current code has several local ledgers, but VirtualMemory, FrameAllocator, and process/resource quotas are not yet unified.
-
No notification primitive. seL4 notifications (lightweight bitmask signal/wait) are needed for interrupt delivery and event notification without full capnp message overhead.
-
No runtime-controlled FS-base/thread TLS API. Static ELF TLS and context-switch FS-base state exist, but Go and future user threads still need a way to set FS base per thread.
References
See individual deep-dive reports for full reference lists. Key primary sources:
- Klein et al., “seL4: Formal Verification of an OS Kernel,” SOSP 2009
- Lyons et al., “Scheduling-context capabilities,” EuroSys 2018
- Shapiro et al., “EROS: A Fast Capability System,” SOSP 1999
- Shapiro & Weber, “Verifying the EROS Confinement Mechanism,” IEEE S&P 2000
- Pike et al., “The Use of Name Spaces in Plan 9,” OSR 1993
- Feske, “Genode Foundations” (genode.org/documentation)
- Fuchsia Zircon kernel documentation (fuchsia.dev)