Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

seL4 Deep Dive: Lessons for capOS

Research notes on seL4’s design, covering formal verification, capability model, IPC, scheduling, and applicability to capOS.

Primary sources: “seL4: Formal Verification of an OS Kernel” (Klein et al., SOSP 2009), seL4 Reference Manual (v12.x / v13.x), “The seL4 Microkernel – An Introduction” (whitepaper, 2020), “Towards a Verified, General-Purpose Operating System Kernel” (Klein et al., 2008), “Principled Approach to Kernel Design for MCS” (Lyons et al., 2018), seL4 source code and API documentation.


1. Formal Verification Approach

What seL4 Proves

seL4 is the first general-purpose OS kernel with a machine-checked proof of functional correctness. The verification chain establishes:

  1. Functional correctness: The C implementation of the kernel refines (faithfully implements) an abstract specification written in Isabelle/HOL. Every possible execution of the C code corresponds to an allowed behavior in the abstract spec. This is not “absence of some bug class” – it is a complete behavioral equivalence between spec and code.

  2. Integrity (access control): The kernel enforces capability-based access control. A process cannot access a kernel object unless it holds a capability to it. This is proven as a consequence of functional correctness: the spec defines access rules, and the implementation provably follows them.

  3. Confidentiality (information flow): In the verified configuration, information cannot flow between security domains except through explicitly authorized channels. This proves noninterference at the kernel level.

  4. Binary correctness: The proof chain extends from the abstract spec through a Haskell executable model, then to the C implementation, and finally to the compiled ARM binary (via the verified CAmkES/CompCert chain or translation validation against GCC output). On ARM, the compiled binary is proven to behave as the C source specifies.

The Verification Chain

Abstract Specification (Isabelle/HOL)
    |
    | refinement proof
    v
Executable Specification (Haskell)
    |
    | refinement proof
    v
C Implementation (10,000 lines of C)
    |
    | translation validation / CompCert
    v
ARM Binary

Each refinement step proves that the lower-level implementation is a correct realization of the higher-level spec. The Haskell model serves as an “executable spec” – it’s precise enough to run but abstract enough to reason about.

Properties Verified

  • No null pointer dereferences – a consequence of functional correctness.
  • No buffer overflows – all array accesses are proven in-bounds.
  • No arithmetic overflow – all integer operations are proven safe.
  • No use-after-free – memory management correctness is proven.
  • No memory leaks (in the kernel) – all allocated memory is accounted for.
  • No undefined behavior – the C code is proven to avoid all UB.
  • Capability enforcement – objects are only accessible through valid capabilities, and capabilities cannot be forged.
  • Authority confinement – proven that authority does not leak beyond what capabilities allow.

Practical Implications

What verification buys you:

  • Eliminates all implementation bugs in the verified code. Not “most bugs” or “common bug classes” – literally all of them, for the verified configuration.
  • The security properties (integrity, confidentiality) hold absolutely, not probabilistically.
  • Makes the kernel trustworthy as a separation kernel / isolation boundary.

What verification does NOT cover:

  • The specification itself could be wrong (it could specify the wrong behavior). Verification proves “code matches spec,” not “spec is correct.”
  • Hardware must behave as modeled. The proof assumes a correct CPU, correct memory, no physical attacks. DMA from malicious devices can break isolation unless an IOMMU is used (and IOMMU management is proven correct).
  • Only the verified configuration is covered. seL4 has unverified configurations (e.g., SMP, RISC-V, certain platform features). Using unverified features voids the proof.
  • Performance-critical code paths (like the IPC fastpath) were initially outside the verification boundary, though significant progress has been made on verifying them.
  • The bootloader and hardware initialization code are outside the proof boundary.
  • Compiler correctness: on x86, the proof trusts GCC. On ARM, binary verification closes this gap.

Design Constraints Imposed by Verification

The requirement of formal verification has profoundly shaped seL4’s design:

  1. Small kernel: ~10,000 lines of C. Every line must be verified, so the kernel is as small as possible. Drivers, file systems, networking – everything lives in user space.

  2. No dynamic memory allocation in the kernel: The kernel does not have a general-purpose heap. All kernel memory is pre-allocated and managed through typed capabilities (Untyped memory). This eliminates an entire class of verification complexity (heap reasoning).

  3. No concurrency in the kernel: seL4 runs the kernel as a single- threaded “big lock” model (interrupts disabled in kernel mode). SMP is handled by running independent kernel instances on each core with explicit message passing between them (the “clustered multikernel” approach), or by using a big kernel lock (the current SMP approach, which is NOT covered by the verification proof).

  4. C implementation: Written in a restricted subset of C that is amenable to Isabelle/HOL reasoning. No function pointers (mostly), no complex pointer arithmetic, no compiler-specific extensions. This makes the code more rigid than typical C but provable.

  5. Fixed system call set: The kernel API is small and fixed. Adding a new syscall requires extending the proofs – a major effort.

  6. Platform-specific verification: The proof is per-platform. ARM was verified first; x86 verification came later with additional effort. Each new platform requires new proofs.


2. Capability Transfer Model

Core Concepts

seL4’s capability model descends from the EROS/KeyKOS tradition but with significant innovations driven by formal verification requirements.

Kernel Objects: Everything the kernel manages is an object: TCBs (thread control blocks), endpoints (IPC channels), CNodes (capability storage), page tables, frames, address spaces (VSpaces), untyped memory, and more. The kernel tracks the exact type and state of every object.

Capabilities: A capability is a reference to a kernel object combined with access rights. Capabilities are stored in kernel memory, never directly accessible to user space. User space refers to capabilities by position in its capability space.

CSpaces, CNodes, and CSlots

CSlot (Capability Slot): A single storage location that can hold one capability. A CSlot is either empty or contains a capability (object pointer

  • access rights + badge).

CNode (Capability Node): A kernel object that is a power-of-two-sized array of CSlots. A CNode with 2^n slots has a “guard” and a “radix” of n. CNodes are the building blocks of the capability addressing tree.

CSpace (Capability Space): The complete capability namespace of a thread. A CSpace is a tree of CNodes, rooted at the thread’s CSpace root (a CNode pointed to by the TCB). Capability lookup traverses this tree.

Thread's TCB
  |
  +-- CSpace Root (CNode, 2^8 = 256 slots)
        |
        +-- slot 0: cap to Endpoint A
        +-- slot 1: cap to Frame X
        +-- slot 2: cap to another CNode (2^4 = 16 slots)
        |     |
        |     +-- slot 0: cap to Endpoint B
        |     +-- slot 1: empty
        |     +-- ...
        +-- slot 3: empty
        +-- ...

Capability Addressing (CPtr and Depth)

A CPtr (Capability Pointer) is a word-sized integer used to name a capability within a thread’s CSpace. It is NOT a memory pointer – it is an index that the kernel resolves by walking the CNode tree.

Resolution works bit-by-bit from the most significant end:

  1. Start at the CSpace root CNode.
  2. The CNode’s guard is compared against the corresponding bits of the CPtr. If they don’t match, the lookup fails. Guards allow sparse addressing without allocating huge CNode arrays.
  3. The next radix bits of the CPtr are used as an index into the CNode array.
  4. If the slot contains a CNode capability, recurse: consume the next bits of the CPtr to walk deeper.
  5. If the slot contains any other capability, the lookup is complete.
  6. The depth parameter in the syscall tells the kernel how many bits of the CPtr to consume. This disambiguates between “stop at this CNode cap” and “descend into this CNode.”

Example: A CPtr of 0x4B with a two-level CSpace:

  • Root CNode: guard = 0, radix = 4 (16 slots)
  • Bits [7:4] = 0x4 -> index into root CNode slot 4
  • Slot 4 contains a CNode cap: guard = 0, radix = 4 (16 slots)
  • Bits [3:0] = 0xB -> index into second-level CNode slot 11
  • Slot 11 contains an Endpoint cap -> lookup complete

Flat Table vs. Hierarchical CSpace

seL4’s hierarchical CSpace has significant implications:

Advantages of hierarchical:

  • Sparse capability spaces without wasting memory. A process can have a huge CPtr range with only a few CNodes allocated.
  • Subtree delegation: a parent can give a child a CNode cap that grants access to a subset of capabilities. The child can manage its own subtree without affecting the parent’s.
  • Guards compress address bits, allowing efficient encoding of large capability identifiers.

Disadvantages of hierarchical:

  • Lookup is slower than a flat array index – multiple memory indirections per resolution.
  • More complex kernel code (and more complex verification).
  • User space must explicitly manage CNode allocation and CSpace layout.

capOS comparison: capOS uses a flat Vec<Option<Arc<dyn CapObject>>> indexed by CapId (u32). The shared Arc lets a single kernel capability back multiple per-process slots, which is what makes cross-process IPC work when another service resolves its CapRef via CapSource::Service. The flat layout is simpler and faster for lookup (single array index), but cannot support sparse addressing or subtree delegation. For capOS’s research goals, the flat approach is adequate initially. If capOS needs hierarchical delegation later (e.g., a supervisor delegating a subset of caps to a child without copying), it could add a level of indirection without adopting seL4’s full tree model.

Capability Operations

seL4 provides these operations on capabilities:

Copy: Duplicate a capability from one CSlot to another. Both the source and destination must be in the caller’s CSpace (or the caller must have CNode caps to the relevant CNodes). The new cap has the same authority as the original, minus any rights the caller chooses to strip.

Mint: Like Copy, but also sets a badge on the new capability. A badge is a word-sized integer embedded in the capability that is delivered to the receiver when the capability is used. Badges allow a server to distinguish which client is calling – each client gets a differently-badged cap to the same endpoint, and the server sees the badge on each incoming message.

Move: Transfer a capability from one CSlot to another. The source slot becomes empty. This is an atomic transfer of authority.

Mutate: Move + modify rights or badge in one operation.

Delete: Remove a capability from a CSlot, making it empty.

Revoke: Delete a capability AND all capabilities derived from it. This is the most powerful operation – it allows a parent to withdraw authority it granted to children, transitively.

Capability Derivation and the CDT

seL4 tracks a Capability Derivation Tree (CDT) – a tree recording which capability was derived from which. When capability A is copied or minted to produce capability B, B becomes a child of A in the CDT.

Revoke(A) deletes all descendants of A in the CDT but leaves A itself. This gives the holder of A the power to revoke all authority derived from their own authority.

The CDT is critical for clean revocation but adds significant kernel complexity. It requires maintaining a tree structure across all capability copies throughout the system.

Untyped Memory and Retype

One of seL4’s most distinctive features is that the kernel never allocates memory on its own. All physical memory is initially represented as Untyped Memory capabilities. To create any kernel object (endpoint, CNode, TCB, page frame, etc.), user space must invoke the Untyped_Retype operation on an untyped cap, which carves out a portion of the untyped memory and creates a new typed object.

This means:

  • User space (specifically, the root task or a memory manager) controls all memory allocation.
  • The kernel has zero internal allocation – all memory it uses comes from retyped untypeds.
  • Memory exhaustion is impossible in the kernel – if a syscall needs memory, user space must have provided it in advance via retype.
  • Revoke on an untyped cap destroys ALL objects created from it, reclaiming the memory. This is the mechanism for wholesale cleanup.

3. IPC Fastpath

Overview

seL4’s IPC is synchronous and endpoint-based. An endpoint is a rendezvous point: the sender blocks until a receiver is ready, or vice versa. There is no buffering in the kernel (unlike Mach ports or Linux pipes).

The IPC fastpath is a highly optimized code path for the common case of a short synchronous call/reply between two threads. It is one of seL4’s signature performance features.

How the Fastpath Works

When thread A calls seL4_Call(endpoint_cap, msg):

  1. Capability lookup: Resolve the CPtr to find the endpoint cap. In the fastpath, this is optimized to handle the common case of a direct CSlot lookup (single-level CSpace, no guard traversal needed).

  2. Receiver check: Is there a thread waiting on this endpoint? If yes, the fastpath applies. If no (receiver isn’t ready), fall to the slowpath which queues the sender.

  3. Direct context switch: Instead of the normal path (save sender registers -> return to scheduler -> pick receiver -> restore receiver registers), the fastpath performs a direct register transfer:

    • Save the sender’s register state into its TCB.
    • Copy the message registers (a small number, typically 4-8 words) from the sender’s physical registers directly into the receiver’s TCB (or leave them in registers if possible).
    • Load the receiver’s page table root (vspace) into CR3/TTBR.
    • Switch to the receiver’s kernel stack.
    • Restore the receiver’s register state.
    • Return to user mode as the receiver.

    This is a direct context switch – the kernel goes directly from the sender to the receiver without passing through the scheduler. The IPC operation IS the context switch.

  4. Reply cap: The sender’s reply cap is set up so the receiver can reply. In the classic (non-MCS) model, a one-shot reply capability is placed in the receiver’s TCB. The receiver calls seL4_Reply(msg) to send the response directly back.

Performance Characteristics

seL4 IPC is among the fastest measured:

  • ARM (Cortex-A9): ~240 cycles for a Call+Reply round-trip (including two privilege transitions, a full context switch, and message transfer).
  • x86-64: ~380-500 cycles for a Call+Reply round-trip depending on hardware generation.
  • Message size: The fastpath handles small messages (fits in registers). Longer messages require copying from IPC buffer pages and take the slowpath.

For comparison:

  • Linux pipe IPC: ~5,000-10,000 cycles for a round-trip.
  • Mach IPC (macOS XNU): ~3,000-5,000 cycles.
  • L4/Pistachio: ~700-1,000 cycles (seL4 improved on this).

Fastpath Constraints

The fastpath is only taken when ALL of these conditions hold:

  1. The operation is seL4_Call or seL4_ReplyRecv (the two most common IPC operations).
  2. The message fits in message registers (no extra caps, no long messages that require the IPC buffer).
  3. The capability lookup is “simple” – single-level CSpace, direct slot lookup, no guard bits to check.
  4. There IS a thread waiting at the endpoint (no need to block the sender).
  5. The receiver is at sufficient priority (in the non-MCS configuration, higher priority than any other runnable thread – or in MCS, the scheduling context can be donated).
  6. No capability transfer is happening in this message.
  7. Certain bookkeeping conditions are met (no pending operations on either thread, no debug traps, etc.).

When any condition fails, the kernel falls through to the slowpath, which handles the general case correctly but with more overhead (~5-10x slower than the fastpath).

Direct Switch Mechanics

The key insight is: when thread A calls thread B synchronously, A is going to block until B replies. There is no scheduling decision to make – the only correct action is to run B immediately. So the kernel skips the scheduler entirely:

Thread A (running)          Kernel              Thread B (blocked on recv)
    |                         |                        |
    | seL4_Call(ep, msg) ---> |                        |
    |                         | [fastpath]             |
    |                         | Save A's regs          |
    |                         | Copy msg A -> B        |
    |                         | Switch page tables     |
    |                         | Restore B's regs       |
    |                         | ---------------------->|
    |                         |                        | [running, processes msg]
    |                         |                        |
    |                         | <--- seL4_Reply(reply)  |
    |                         | [fastpath again]       |
    |                         | Save B's regs          |
    |                         | Copy reply B -> A      |
    |                         | Switch page tables     |
    |                         | Restore A's regs       |
    | <-----------------------|                        |
    | [running, has reply]    |                        |

The entire round-trip involves exactly two kernel entries and two context switches, with no scheduler invocation.

Implications

  1. RPC is the natural IPC pattern: seL4’s IPC is optimized for the client-server call/reply pattern. Fire-and-forget or multicast patterns require different mechanisms (notifications, shared memory).

  2. Notifications: For async signaling (like interrupts or events), seL4 provides notification objects – a lightweight word-sized bitmask that can be signaled and waited on without message transfer. These are separate from endpoints.

  3. Shared memory for bulk transfer: IPC messages are small (register- sized). For large data transfers, the standard pattern is: set up shared memory, then use IPC to synchronize. This is explicit – the kernel doesn’t transparently copy large buffers.


4. CNode/CSpace Architecture in Detail

CNode Structure

A CNode object is a contiguous array of CSlots in kernel memory. The size is always a power of two. The kernel metadata for a CNode includes:

  • Radix bits: log2 of the number of slots (e.g., radix=8 means 256 slots).
  • Guard value: a bit pattern that must match the CPtr during resolution.
  • Guard bits: the number of bits in the guard.

The total bits consumed during resolution of one CNode level is: guard_bits + radix_bits.

Multi-Level Resolution Example

Consider a two-level CSpace:

Root CNode: guard=0 (0 bits), radix=8 (256 slots)
  Slot 5 -> CNode B: guard=0x3 (2 bits), radix=6 (64 slots)
    Slot 42 -> Endpoint X

To reach Endpoint X with a 16-bit CPtr at depth 16:

  • CPtr = 0b 00000101 11 101010
  • Root CNode consumes 8 bits: 00000101 = 5 -> Slot 5 (CNode B cap)
  • CNode B guard: next 2 bits = 11 -> matches guard 0x3 -> OK
  • CNode B radix: next 6 bits = 101010 = 42 -> Slot 42 (Endpoint X)
  • Total bits consumed: 8 + 2 + 6 = 16 = depth -> resolution complete

CSpace Layout Strategies

Flat: One large root CNode with radix=N, no sub-CNodes. Simple, fast lookup (one level). Wastes memory if the CPtr space is sparse.

Two-level: Small root CNode pointing to sub-CNodes. Common for processes that need moderate capability counts.

Deep: Many levels. Useful for delegation: a supervisor gives a child a cap to a sub-CNode, and the child manages its own CSpace subtree below that point.

Comparison with capOS’s Flat Table

AspectseL4 CSpacecapOS CapTable
StructureTree of CNodesFlat Vec<Option<Arc<dyn CapObject>>>
Lookup costO(depth) memory indirectionsO(1) array index
Sparse supportYes (guards + tree)No (dense array, holes via free list)
Subtree delegationYes (grant CNode cap)No
Memory overheadCNode objects are power-of-2Exact-sized Vec
ComplexityHigh (bit-level CPtr resolution)Low
Capability identityPosition in CSpaceCapId (u32 index)
Verification burdenVery highN/A (Rust safety)

5. MCS (Mixed-Criticality Systems) Scheduling

Background

The original seL4 scheduling model is a simple priority-preemptive scheduler with 256 priority levels and round-robin within each level. This model has a known flaw: priority inversion through IPC. When a high-priority thread calls a low-priority server, the reply might be delayed indefinitely by medium-priority threads preempting the server. The classic solution (priority inheritance) is complex to verify and doesn’t compose well.

The MCS extensions redesign scheduling to solve this and provide temporal isolation.

Key Concepts

Scheduling Context (SC): A new kernel object that represents the “right to execute on a CPU.” An SC holds:

  • A budget (microseconds of CPU time per period)
  • A period
  • A priority
  • Remaining budget in the current period

A thread must have a bound SC to be runnable. Without an SC, a thread cannot execute regardless of its priority.

Reply Object: In the MCS model, the one-shot reply capability from classic seL4 is replaced by an explicit Reply kernel object. When thread A calls thread B:

  1. A’s scheduling context is donated to B.
  2. A reply object is created to hold A’s return path.
  3. B now runs on A’s scheduling context (A’s priority and budget).
  4. When B replies, the SC returns to A.

This solves priority inversion: the server (B) inherits the caller’s priority and budget automatically.

Passive servers: A server thread can exist without its own SC. It only becomes runnable when a client donates an SC via the Call operation. When it replies, it becomes passive again. This is powerful:

  • No CPU time is “reserved” for idle servers.
  • The server executes on the client’s budget – the client pays for the work it requests.
  • Multiple clients can call the same passive server; each brings its own SC.

Temporal Isolation

MCS SCs provide temporal fault isolation:

  • Each SC has a fixed budget/period. A thread cannot exceed its budget in any period. When the budget expires, the thread is descheduled until the next period begins.
  • This is enforced by hardware timer interrupts – the kernel programs the timer to fire when the current SC’s budget expires.
  • A misbehaving (or compromised) component cannot starve other components because its SC bounds its CPU consumption.
  • This works even across IPC: if client A calls server B with A’s SC, the combined execution of A+B is bounded by A’s budget.

Comparison with capOS’s Scheduler

capOS currently has a round-robin scheduler (kernel/src/sched.rs) with no priority levels and no temporal isolation:

#![allow(unused)]
fn main() {
struct Scheduler {
    processes: BTreeMap<Pid, Process>,
    run_queue: VecDeque<Pid>,
    current: Option<Pid>,
}
}

Timer preemption, cap_enter blocking waits, Endpoint IPC, and a baseline direct IPC handoff are implemented. The MCS model is relevant for the next scheduling step because the same priority inversion problem arises when a high-priority client calls a low-priority server through a capability.


6. Relevance to capOS

6.1 Formal Verification

Applicability: Low in the near term. seL4’s verification is done in Isabelle/HOL over C code, which doesn’t transfer to Rust. However, the constraints that verification imposed are valuable design guidance:

  • Minimal kernel: seL4’s ~10K lines of C demonstrate how little code a microkernel actually needs. capOS should resist adding kernel features and instead move them to user space.
  • No kernel heap allocation on the critical path: seL4’s “untyped memory” approach where user space provides all memory is worth studying. capOS has removed the earlier allocation-heavy synchronous ring dispatch path, but it still uses owned kernel objects and preallocated scratch rather than a user-supplied untyped-memory model.
  • No kernel concurrency: seL4 avoids kernel-level concurrency entirely (SMP uses separate kernel instances or a big lock). capOS currently uses spin::Mutex around the scheduler and capability tables. The seL4 approach suggests this is acceptable until/unless per-CPU kernel instances are needed.

Rust alternative: Rust’s type system provides memory safety guarantees that overlap with some of seL4’s verified properties (no buffer overflows, no use-after-free, no null dereference in safe code). This is not a substitute for functional correctness proofs, but it significantly raises the bar compared to unverified C. Ongoing research in Rust formal verification (e.g., Prusti, Creusot, Verus) may eventually enable seL4-style proofs over Rust kernels.

6.2 Capability Model

CNode tree vs. flat table: capOS’s flat CapTable is the right choice for now. seL4’s CNode tree exists to support delegation (granting a subtree of your CSpace to a child) and sparse addressing. capOS’s current model gives each process its own independent flat table and now supports manifest-provided caps plus explicit copy/move transfer descriptors through Endpoint IPC. If capOS later needs fine-grained delegation (a parent granting access to a subset of its caps without copying), it can add a level of indirection:

Option A: Proxy capability objects that forward to the parent's table
Option B: A two-level table (small root array -> larger sub-arrays)
Option C: Shared capability objects with refcounting

Badge/Mint pattern: seL4’s badge mechanism is directly applicable to capOS. When multiple clients share a capability to the same server endpoint, the server needs to distinguish callers. In seL4, each client gets a differently-badged copy of the endpoint cap. The badge is delivered with each message.

capOS has implemented this by adding badge metadata to capability references and hold edges. Endpoint CALL delivery reports the invoked hold badge to the receiver, and copy/move transfer preserves badge metadata.

Current ring SQEs carry cap id and method id separately. The cap table stores badge and transfer-mode metadata alongside the object reference:

#![allow(unused)]
fn main() {
struct CapEntry {
    object: Arc<dyn CapObject>,
    badge: u64,
    transfer_mode: CapTransferMode,
}
}

Revocation (CDT): seL4’s Capability Derivation Tree is its most complex internal structure. For capOS, full CDT-style transitive revocation is probably overkill initially. The service-architecture proposal already identifies simpler alternatives:

  • Generation counters: Each capability has a generation number. Bumping the generation invalidates all references without traversing a tree.
  • Proxy caps: A proxy object that can be invalidated by its creator. Callers hold the proxy, not the real capability.
  • Process-lifetime revocation: When a process dies, all caps it held are automatically invalidated (seL4 does this too, but the CDT allows more fine-grained revocation within a living process).

Untyped memory: seL4’s “no kernel allocation” model via untyped memory is elegant but probably too heavyweight for capOS’s current stage. The key takeaway is the principle: user space should control resource allocation as much as possible. capOS’s FrameAllocator capability already moves frame allocation authority into the capability model.

6.3 IPC Design

This is the most directly actionable area for capOS’s Stage 6.

seL4’s model (synchronous rendezvous + direct switch) vs. capOS’s model (async rings + Cap’n Proto wire format):

AspectseL4capOS
IPC primitiveSynchronous endpointAsync submission/completion rings
Message formatUntyped words in registersCap’n Proto serialized messages
Bulk transferShared memory (explicit)TBD (copy in kernel or shared memory)
Message sizeSmall (register-sized, ~4-8 words)Variable (up to 64KB currently)
Scheduling integrationDirect switch (caller -> callee)Baseline direct IPC handoff implemented
BatchingNo (one message per syscall)Yes (io_uring-style ring)

Key lessons from seL4’s IPC for capOS:

  1. Direct switch for synchronous RPC: Even with async rings, capOS needs a synchronous fast path. The baseline single-CPU direct IPC handoff is implemented for the case where process A calls an Endpoint and process B is blocked waiting in RECV. Future work is register payload transfer and measured fastpath tuning.

  2. Register-based message transfer for small messages: seL4 avoids copying message bytes through kernel buffers for small messages by transferring them through registers during the context switch. capOS currently moves serialized payloads through ring buffers and bounded kernel scratch. For cross-process IPC, minimizing copies is critical. Options:

    • Small messages (<64 bytes) could be transferred in registers during direct switch.
    • Large messages could use shared memory regions (mapped into both address spaces) with IPC used only for synchronization.
    • The io_uring-style rings are already shared memory – the submission and completion ring buffers could potentially be mapped into both the caller’s and callee’s address spaces for zero-copy IPC.
  3. Separate mechanisms for sync and async: seL4 uses endpoints for synchronous IPC and notification objects for async signaling. capOS’s io_uring approach inherently supports batched async operations, but the common case of a simple RPC call-and-wait should have a fast synchronous path too. The two mechanisms complement each other.

  4. Notifications for interrupts and events: seL4’s notification objects (lightweight bitmask signal/wait) map well to capOS’s interrupt delivery model. When a hardware interrupt fires, the kernel signals a notification object, and the driver thread waiting on that notification wakes up. This is cleaner than delivering interrupts as full IPC messages.

The Cap’n Proto dimension: capOS’s use of Cap’n Proto wire format for capability messages is a significant divergence from seL4’s untyped word arrays. Tradeoffs:

  • Pro: Type safety, schema evolution, language-neutral interfaces, built-in serialization/deserialization, native support for capability references in messages (Cap’n Proto has a “capability table” concept in its RPC protocol).
  • Con: Serialization overhead. Even Cap’n Proto’s zero-copy format requires pointer validation and bounds checking that seL4’s raw register transfer does not. For very hot IPC paths, this overhead may be significant.
  • Mitigation: For the hot path, capOS could define a “small message” format that bypasses full capnp serialization – just a few raw words, similar to seL4’s register message. Fall back to full capnp for larger or more complex messages.

6.4 MCS Scheduling

Priority donation via IPC: Directly relevant when capOS implements cross-process capability calls. If process A (high priority) calls a capability in process B (low priority), B needs to run at A’s priority to avoid inversion. The seL4 MCS approach of “donating” the scheduling context with the IPC message is clean and composable.

For capOS, the io_uring model complicates this slightly: if submissions are batched, which submitter’s priority should the server inherit? Options:

  • Inherit the highest priority among pending submissions.
  • Each submission carries its own priority/scheduling context.
  • Use the synchronous fast-path (with donation) for priority-sensitive calls, and the async ring for bulk/background operations.

Passive servers: The MCS concept of servers that only consume CPU when called (by borrowing the caller’s scheduling context) maps well to capOS’s capability-based services. A network stack server that only runs when a client sends a request, consuming the client’s CPU budget, is a natural fit for capOS’s service architecture.

Temporal isolation: Budget/period enforcement prevents denial-of-service between capability holders. Even if process A holds a capability to process B, A cannot cause B to consume unbounded CPU time – B’s execution on behalf of A is bounded by A’s scheduling context budget. This is worth considering for capOS’s roadmap, especially for the cloud deployment scenario where isolation is critical.

6.5 Specific Recommendations for capOS

Near-term (Stages 5-6):

  1. Badge field on cap holds: Done. Manifest CapRef badge metadata is carried into cap-table hold edges, delivered to Endpoint receivers, and preserved across copy/move transfer.

  2. Implement direct-switch IPC for synchronous calls: Baseline done for Endpoint receivers blocked in RECV. Remaining work is the measured fastpath shape, especially small-message register transfer.

  3. Keep the flat CapTable: seL4’s CNode tree complexity is justified by formal verification constraints and subtree delegation. capOS’s flat table is simpler and sufficient. Add proxy/wrapper capabilities for delegation rather than restructuring the table.

  4. Add notification objects: A lightweight signaling primitive (word- sized bitmask, signal/wait operations) for interrupt delivery and event notification. Much cheaper than sending a full capnp message for “wake up, there’s work to do.”

Medium-term (post-Stage 6):

  1. Scheduling context donation: When implementing priority scheduling, attach a scheduling context to IPC calls so servers inherit caller priority. This prevents priority inversion through the capability graph.

  2. Capability rights attenuation: Add a rights mask to capability references so a parent can grant a cap with reduced permissions (e.g., read-only access to a read-write capability). seL4’s rights bits are: Read, Write, Grant (can pass the cap to others), GrantReply (can pass reply cap only).

  3. Revocation via generation/epoch counters: Generation-tagged CapIds catch stale slot reuse today. Object-wide epoch revocation remains future work and is simpler than a derivation tree.

Long-term (research directions):

  1. Zero-copy IPC via shared memory: For bulk data transfer between processes, map shared memory regions (Cap’n Proto segments) into both address spaces. Use IPC only for synchronization and capability transfer. This combines seL4’s “shared memory + IPC sync” pattern with capOS’s Cap’n Proto wire format.

  2. Rust-native verification: Track developments in Verus, Prusti, and other Rust verification tools. capOS’s Rust implementation is better positioned for future formal verification than a C implementation would be, given the type system guarantees already present.

  3. Untyped memory model: Consider moving kernel object allocation entirely into capability-gated operations (like seL4’s Retype). User space provides memory for all kernel objects, ensuring the kernel never runs out of memory on its own. This is a significant architectural change but aligns with the “everything is a capability” principle.


Summary Table

seL4 FeatureMaturitycapOS EquivalentRecommended Action
Functional correctness proofProductionNone (Rust type safety)Track Rust verification tools
CNode/CSpace treeProductionFlat CapTableKeep flat
Capability badge/mintProductionHold-edge badgeDone baseline
Revocation (CDT)ProductionGeneration-tagged CapId; no epoch yetUse epoch revocation instead of CDT
Untyped memory / RetypeProductionFrameAllocator capConsider for hardening phase
Synchronous IPC endpointsProductionEndpoint CALL/RECV/RETURNDone baseline
IPC fastpath (direct switch)ProductionDirect IPC handoffDone baseline; tune register payload later
Notification objectsProductionNoneImplement as lightweight signal primitive
MCS Scheduling ContextsProductionRound-robin schedulerImplement SC donation for IPC
Passive serversProductionNoneNatural fit with cap-based services
Temporal isolationProductionNoneConsider for cloud deployment

References

  1. Klein, G., et al. “seL4: Formal Verification of an OS Kernel.” SOSP 2009.
  2. seL4 Reference Manual, versions 12.1.0 and 13.0.0.
  3. “The seL4 Microkernel – An Introduction.” seL4 Foundation Whitepaper, 2020.
  4. Lyons, A., et al. “Scheduling-context capabilities: A principled, light-weight operating-system mechanism for managing time.” EuroSys 2018.
  5. Heiser, G., & Elphinstone, K. “L4 Microkernels: The Lessons from 20 Years of Research and Deployment.” SOSP 2016.
  6. seL4 source code: https://github.com/seL4/seL4
  7. seL4 API documentation: https://docs.sel4.systems/