Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Research: Capability-Based and Microkernel Operating Systems

Survey of existing systems to inform capOS design decisions across IPC, scheduling, capability model, persistence, VFS, and language support.

Design consequences for capOS

  • Keep the flat generation-tagged capability table; seL4-style CNode hierarchy is not needed until delegation patterns demand it.
  • Treat the typed Cap’n Proto interface as the permission boundary; avoid a parallel rights-bit system that would drift from schema semantics.
  • Continue the ring transport plus direct-handoff IPC path, with shared memory reserved for bulk data once SharedBuffer/MemoryObject exists.
  • Use badge metadata, move/copy transfer descriptors, and future epoch revocation to make authority delegation explicit and reviewable.
  • Keep persistence explicit through Store/Namespace capabilities; do not adopt EROS-style transparent global checkpointing as a kernel baseline.
  • Push POSIX compatibility and VFS behavior into libraries and services rather than adding a kernel global filesystem namespace.
  • Add resource donation, scheduling-context donation, notification objects, and runtime/thread primitives only when the corresponding service or runtime path needs them.

Individual deep-dive reports:

  • seL4 – formal verification, CNode/CSpace, IPC fastpath, MCS scheduling
  • Fuchsia/Zircon – handles with rights, channels, VMARs/VMOs, ports, FIDL vs Cap’n Proto
  • Plan 9 / Inferno – per-process namespaces, 9P protocol, file-based vs capability-based interfaces
  • EROS / CapROS / Coyotos – persistent capabilities, single-level store, checkpoint/restart
  • Genode – session routing, VFS plugins, POSIX compat, resource trading, Sculpt OS
  • LLVM target customization – target triples, TLS models, Go runtime requirements
  • Cap’n Proto error handling – protocol, schema, and Rust crate error behavior used by the capOS error model
  • OS error handling – error patterns in capability systems and microkernels used by the capOS error model
  • IX-on-capOS hosting – clean integration of IX package/build model via MicroPython control plane, native template rendering, Store/Namespace, and build services
  • Out-of-kernel scheduling – whether scheduler policy can move to user space, and which dispatch/enforcement mechanisms must stay in kernel

Cross-Cutting Analysis

1. Capability Table Design

All surveyed systems store capabilities as process-local references to kernel objects. The key design variable is how capabilities are organized.

SystemStructureLookupDelegationRevocation
seL4Tree of CNodes (power-of-2 arrays with guard bits)O(depth)Subtree (grant CNode cap)CDT (derivation tree), transitive
ZirconFlat per-process handle tableO(1)Transfer through channels (move)Close handle; refcount; no propagation
EROS32-slot nodes forming treesO(depth)Node key passingForwarder keys (O(1) rescind)
GenodeKernel-enforced capability referencesO(1)Parent-mediated session routingSession close
capOSFlat table with generation-tagged CapId, hold-edge metadata, and Arc<dyn CapObject> backingO(1)Manifest exports plus copy/move transfer descriptors through Endpoint IPCLocal release/process exit; epoch revocation not yet

Recommendation for capOS: Keep the flat table. It is simpler than seL4’s CNode tree and sufficient for capOS’s use cases. Augment each entry with:

  1. Badge (from seL4) – u64 value delivered to the server on invocation, allowing a server to distinguish callers without separate capability objects.
  2. Generation counter (from Zircon) – upper bits of CapId detect stale references after a slot is reused. (Implemented.)
  3. Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.

Not adopted: per-entry rights bitmask. Zircon and seL4 use rights bitmasks (READ/WRITE/EXECUTE) because their handle/syscall interfaces are untyped. capOS uses Cap’n Proto typed interfaces where the schema defines what methods exist. Method-level access control is the interface itself – to restrict what a caller can do, grant a narrower capability (a wrapper CapObject that exposes fewer methods). A parallel rights system would create an impedance mismatch: generic flags (READ/WRITE) mapped arbitrarily onto typed methods. Meta-rights for the capability reference itself (TRANSFER/DUPLICATE) may be added when Stage 6 IPC needs them. See capability-model.md for the full rationale.

2. IPC Design

IPC is the most performance-critical kernel mechanism. Every capability invocation across processes goes through it.

SystemModelLatency (round-trip)Bulk dataAsync
seL4Synchronous endpoint, direct context switch~240 cycles (ARM), ~400 cycles (x86)Shared memory (explicit)Notification objects (bitmask signal/wait)
ZirconChannels (async message queue, 64KiB + 64 handles)~3000-5000 cyclesVMOs (shared memory)Ports (signal-based notification)
EROSSynchronous domain call~2x L4Through address space nodesNone (synchronous only)
Plan 99P over pipes (kernel-mediated)~5000+ cyclesLarge reads/writes (iounit)None (blocking per-fid)
GenodeRPC objects with session routingVaries by kernel (uses seL4/NOVA/Linux underneath)Shared-memory dataspacesSignal capabilities

Recommendation for capOS: Continue the dual-path IPC design:

Fast synchronous path (seL4-inspired, for RPC):

  • When process A calls a capability in process B and B is blocked waiting, perform a direct context switch (A -> kernel -> B, no unrelated scheduler pick). The current single-CPU direct handoff is implemented.
  • Future fastpath work can transfer small messages (<64 bytes) through registers during the switch instead of copying through ring buffers.

Async submission/completion rings (io_uring-inspired, for batching):

  • SQ/CQ in shared memory for batched capability invocations. This is the current transport for CALL/RECV/RETURN/RELEASE/NOP.
  • Support SQE chaining for Cap’n Proto promise pipelining.
  • Signal/notification delivery through CQ entries (from Zircon ports).
  • User-queued CQ entries for userspace event loop integration.

Bulk data (Zircon/Genode-inspired):

  • SharedBuffer capability for zero-copy data transfer between processes.
  • Capnp messages for control plane; shared memory for data plane.
  • Critical for file I/O, networking, and GPU rendering.

3. Memory Management Capabilities

Zircon’s VMO/VMAR model is the most mature capability-based memory design. The Go runtime proposal shows why these primitives are essential.

VirtualMemory capability (baseline implemented; still central for Go and advanced allocators):

interface VirtualMemory {
    map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    unmap @1 (addr :UInt64, size :UInt64) -> ();
    protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}

MemoryObject capability (needed for IPC bulk data, shared libraries). Zircon calls this concept a VMO (Virtual Memory Object); capOS uses the name SharedBuffer – see docs/proposals/storage-and-naming-proposal.md for the canonical interface definition.

interface MemoryObject {
    read @0 (offset :UInt64, count :UInt64) -> (data :Data);
    write @1 (offset :UInt64, data :Data) -> ();
    getSize @2 () -> (size :UInt64);
    createChild @3 (offset :UInt64, size :UInt64, options :UInt32)
        -> (child :MemoryObject);
}

4. Scheduling

SystemModelPriority inversion solutionTemporal isolation
seL4 (MCS)Scheduling Contexts (budget/period/priority) + Reply ObjectsSC donation through IPC (caller’s SC transfers to callee)Yes (budget enforcement per SC)
ZirconFair scheduler with profiles (deadline, capacity, period)Kernel-managed priority inheritanceProfiles provide some isolation
GenodeDelegated to underlying kernel (seL4/NOVA/Linux)Depends on kernelDepends on kernel
Out-of-kernel policyKernel dispatch/enforcement + user-space policy serviceScheduling-context donation through IPCKernel-enforced budgets, user-chosen policy
User-space runtimesM:N work stealing, fibers, async tasks over kernel threadsRequires futexes, runtime cooperation, and OS-visible blocking eventsUsually runtime-local only

Recommendation for capOS: Start with round-robin (already done). When implementing priority scheduling:

  1. Add scheduling context donation for synchronous IPC: when process A calls process B, B inherits A’s priority and budget. Prevents inversion through the capability graph.
  2. Support passive servers (from seL4 MCS): servers without their own scheduling context that only run when called, using the caller’s budget. Natural fit for capOS’s service architecture.
  3. Add temporal isolation (budget/period per scheduling context) for the cloud deployment scenario.

For moving scheduler policy out of the kernel, see Out-of-kernel scheduling. The key finding is a split between kernel dispatch/enforcement and user-space policy: dispatch, budget enforcement, and emergency fallback remain privileged, while admission control, budgets, priorities, CPU masks, and SQPOLL/core grants can be represented as policy managed by a scheduler service. Thread creation, thread handles, scheduling contexts, and futex authority should be capability-based from the start; the remaining research task is measurement: compare generic capnp/ring calls against compact capability-authorized futex-shaped operations before deciding the futex hot-path encoding.

5. Persistence

SystemModelConsistencyApplication effort
EROS/CapROSTransparent global checkpoint (single-level store)Strong (global snapshot)None (automatic)
Plan 9User-mode file servers with explicit writesPer-file serverFull (explicit save/load)
GenodeApplication-level (services manage own persistence)Per-componentFull
capOS (planned)Content-addressed Store + Namespace capsPer-serviceFull (explicit capnp serialize)

Recommendation for capOS: Three phases, as informed by EROS:

  1. Explicit persistence (current plan) – services serialize state to the Store capability as capnp messages. Simple, gives services control.
  2. Opt-in Checkpoint capability – kernel captures process state (registers, memory, cap table) as capnp messages stored in the Store. Enables process migration and crash recovery for services that opt in.
  3. Coordinated checkpointing – a coordinator service orchestrates consistent snapshots across multiple services.

Persistent capability references (from EROS + Cap’n Proto):

struct PersistentCapRef {
    interfaceId @0 :UInt64;
    objectId @1 :UInt64;
    permissions @2 :UInt32;
    epoch @3 :UInt64;
}

Do NOT implement EROS-style transparent global persistence. The kernel complexity is enormous, debuggability is poor, and Cap’n Proto’s zero-copy serialization already provides near-equivalent benefits for explicit persistence.

6. Namespace and VFS

Plan 9’s per-process namespace is the closest analog to capOS’s per-process capability table. The key insight: Plan 9’s bind/mount with union semantics provides composability that capOS’s current Namespace design lacks.

Recommendation: Extend Namespace with union composition:

enum UnionMode { replace @0; before @1; after @2; }

interface Namespace {
    resolve @0 (name :Text) -> (hash :Data);
    bind @1 (name :Text, hash :Data) -> ();
    list @2 () -> (names :List(Text));
    sub @3 (prefix :Text) -> (ns :Namespace);
    union @4 (other :Namespace, mode :UnionMode) -> (merged :Namespace);
}

VFS as a library (from Genode): libcapos-posix should be an in-process library that translates POSIX calls to capability invocations. Each POSIX process receives a declarative mount table (capnp struct) mapping paths to capabilities. No VFS server needed.

FileServer capability (from Plan 9): For resources that are naturally file-like (config trees, debug introspection, /proc-style interfaces), provide a FileServer interface. Not universal (as in Plan 9) but available where the file metaphor fits.

7. Resource Accounting

Genode’s session quota model addresses a gap in capOS: without resource accounting, a malicious client can exhaust a server’s memory by creating many sessions.

Recommendation: Session-creating capability methods should accept a resource donation parameter:

interface NetworkManager {
    createTcpSocket @0 (bufferPages :UInt32) -> (socket :TcpSocket);
}

The client donates buffer memory as part of the session creation. The server allocates from donated resources, not its own.

8. Language Support Roadmap

From the LLVM research, the recommended order:

StepWhatBlocks
1Custom target JSON (x86_64-unknown-capos)Optional build hygiene; not required by current no_std runtime
2VirtualMemory capabilityDone for baseline map/unmap/protect; Go allocator glue remains
3TLS support (PT_TLS parsing, FS base save/restore)Done for static ELF processes; runtime-controlled FS base remains
4Futex authority capability + measured ABIGo threads, pthreads
5Timer capability (monotonic clock)Go scheduler
6Go Phase 1: minimal GOOS=capos (single-threaded)CUE on capOS
7Kernel threadingGo GOMAXPROCS>1
8C toolchain + libcaposC programs, musl
9Go Phase 2: multi-threaded + concurrent GCGo network services
10Go Phase 3: network pollernet/http on capOS

Key decisions:

  • Keep x86_64-unknown-none for kernel, x86_64-unknown-capos for userspace.
  • Use local-exec TLS model (static linking, no dynamic linker).
  • Implement futex as capability-authorized from the start. Because it operates on memory addresses and must be fast, measure generic capnp/ring calls against a compact capability-authorized operation before fixing the ABI.
  • Go can start with cooperative-only preemption (no signals).

Recommendations by Roadmap Stage

Stage 5: Scheduling

SourceRecommendationPriority
ZirconGeneration counter in CapId (stale reference detection)Done
seL4Add notification objects (lightweight bitmask signal/wait)Medium
LLVMCustom target JSON for userspace (x86_64-unknown-capos)Medium
LLVMRuntime-controlled FS-base operation for Go/threadingMedium

Stage 6: IPC and Capability Transfer

SourceRecommendationPriority
seL4Direct-switch IPC for synchronous cross-process callsDone baseline
seL4Badge field on capability entries (server identifies callers)Done
ZirconMove semantics for capability transfer through IPCDone
ZirconMemoryObject capability (shared memory for bulk data)High
EROSEpoch-based revocation (O(1) revoke, O(1) check)High
ZirconSideband capability-transfer descriptors and result-cap recordsDone baseline
GenodeSharedBuffer capability for data-plane transfersHigh
Plan 9Promise pipelining (SQE chaining in async rings)Medium
GenodeSession quotas / resource donation on session creationMedium
seL4Scheduling context donation through IPCMedium
Plan 9Namespace union composition (before/after/replace)Low

Post-Stage 6 / Future

SourceRecommendationPriority
seL4MCS scheduling (passive servers, temporal isolation)When needed
EROSOpt-in Checkpoint capability for process persistenceWhen needed
GenodeDynamic manifest reconfiguration at runtimeWhen needed
Plan 9exportfs-pattern capability proxy for network transparencyWhen needed
EROSPersistentCapRef struct in capnp for storing capability graphsWhen needed
seL4Rust-native formal verification (track Verus/Prusti)Long-term

Design Decisions Validated

Several capOS design choices are validated by this research:

  1. Cap’n Proto as the universal wire format. Superior to FIDL (random access, zero-copy, promise pipelining, persistence-ready). The right choice. See zircon.md Section 5.

  2. Flat capability table. Simpler than seL4’s CNode tree, sufficient for capOS. Only add complexity (CNode-like hierarchy) if delegation patterns demand it. See sel4.md Section 4.

  3. No ambient authority. Every surveyed capability OS confirms this is essential. EROS proved confinement. seL4 proved integrity. capOS has this by design.

  4. Explicit persistence over transparent. EROS’s single-level store is elegant but the kernel complexity is enormous. Cap’n Proto zero-copy gives most of the benefits. See eros-capros-coyotos.md Section 6.

  5. io_uring-inspired async rings. Better than Zircon’s port model for capOS (operation-based > notification-based). See zircon.md Section 4.

  6. VFS as library, not kernel feature. Genode’s approach, matched by capOS’s planned libcapos-posix. See genode.md Section 3.

  7. No fork(). Genode has operated without fork() for 15+ years, proving it unnecessary. See genode.md Section 4.

Design Gaps Identified

  1. No bulk data path. Copying capnp messages through the kernel works for control but not for file/network/GPU data. SharedBuffer or MemoryObject capability is essential for Stage 6+.

  2. Resource accounting is fragmented. The authority-accounting design exists and current code has several local ledgers, but VirtualMemory, FrameAllocator, and process/resource quotas are not yet unified.

  3. No notification primitive. seL4 notifications (lightweight bitmask signal/wait) are needed for interrupt delivery and event notification without full capnp message overhead.

  4. No runtime-controlled FS-base/thread TLS API. Static ELF TLS and context-switch FS-base state exist, but Go and future user threads still need a way to set FS base per thread.


References

See individual deep-dive reports for full reference lists. Key primary sources:

  • Klein et al., “seL4: Formal Verification of an OS Kernel,” SOSP 2009
  • Lyons et al., “Scheduling-context capabilities,” EuroSys 2018
  • Shapiro et al., “EROS: A Fast Capability System,” SOSP 1999
  • Shapiro & Weber, “Verifying the EROS Confinement Mechanism,” IEEE S&P 2000
  • Pike et al., “The Use of Name Spaces in Plan 9,” OSR 1993
  • Feske, “Genode Foundations” (genode.org/documentation)
  • Fuchsia Zircon kernel documentation (fuchsia.dev)