# seL4 Deep Dive: Lessons for capOS

Research notes on seL4's design, covering formal verification, capability
model, IPC, scheduling, and applicability to capOS.

Primary sources: "seL4: Formal Verification of an OS Kernel" (Klein et al.,
SOSP 2009), seL4 Reference Manual (v12.x / v13.x), "The seL4 Microkernel --
An Introduction" (whitepaper, 2020), "Towards a Verified, General-Purpose
Operating System Kernel" (Klein et al., 2008), "Principled Approach to Kernel
Design for MCS" (Lyons et al., 2018), seL4 source code and API documentation.

---

## 1. Formal Verification Approach

### What seL4 Proves

seL4 is the first general-purpose OS kernel with a machine-checked proof of
functional correctness. The verification chain establishes:

1. **Functional correctness**: The C implementation of the kernel refines
   (faithfully implements) an abstract specification written in Isabelle/HOL.
   Every possible execution of the C code corresponds to an allowed behavior
   in the abstract spec. This is not "absence of some bug class" -- it is a
   complete behavioral equivalence between spec and code.

2. **Integrity (access control)**: The kernel enforces capability-based access
   control. A process cannot access a kernel object unless it holds a
   capability to it. This is proven as a consequence of functional
   correctness: the spec defines access rules, and the implementation provably
   follows them.

3. **Confidentiality (information flow)**: In the verified configuration,
   information cannot flow between security domains except through explicitly
   authorized channels. This proves noninterference at the kernel level.

4. **Binary correctness**: The proof chain extends from the abstract spec
   through a Haskell executable model, then to the C implementation, and
   finally to the compiled ARM binary (via the verified CAmkES/CompCert
   chain or translation validation against GCC output). On ARM, the
   compiled binary is proven to behave as the C source specifies.

### The Verification Chain

```
Abstract Specification (Isabelle/HOL)
    |
    | refinement proof
    v
Executable Specification (Haskell)
    |
    | refinement proof
    v
C Implementation (10,000 lines of C)
    |
    | translation validation / CompCert
    v
ARM Binary
```

Each refinement step proves that the lower-level implementation is a correct
realization of the higher-level spec. The Haskell model serves as an
"executable spec" -- it's precise enough to run but abstract enough to reason
about.

### Properties Verified

- **No null pointer dereferences** -- a consequence of functional correctness.
- **No buffer overflows** -- all array accesses are proven in-bounds.
- **No arithmetic overflow** -- all integer operations are proven safe.
- **No use-after-free** -- memory management correctness is proven.
- **No memory leaks** (in the kernel) -- all allocated memory is accounted for.
- **No undefined behavior** -- the C code is proven to avoid all UB.
- **Capability enforcement** -- objects are only accessible through valid
  capabilities, and capabilities cannot be forged.
- **Authority confinement** -- proven that authority does not leak beyond
  what capabilities allow.

### Practical Implications

**What verification buys you:**
- Eliminates *all* implementation bugs in the verified code. Not "most bugs"
  or "common bug classes" -- literally all of them, for the verified
  configuration.
- The security properties (integrity, confidentiality) hold absolutely, not
  probabilistically.
- Makes the kernel trustworthy as a separation kernel / isolation boundary.

**What verification does NOT cover:**
- The specification itself could be wrong (it could specify the wrong
  behavior). Verification proves "code matches spec," not "spec is correct."
- Hardware must behave as modeled. The proof assumes a correct CPU, correct
  memory, no physical attacks. DMA from malicious devices can break
  isolation unless an IOMMU is used (and IOMMU management is proven correct).
- Only the verified configuration is covered. seL4 has unverified
  configurations (e.g., SMP, RISC-V, certain platform features). Using
  unverified features voids the proof.
- Performance-critical code paths (like the IPC fastpath) were initially
  outside the verification boundary, though significant progress has been
  made on verifying them.
- The bootloader and hardware initialization code are outside the proof
  boundary.
- Compiler correctness: on x86, the proof trusts GCC. On ARM, binary
  verification closes this gap.

### Design Constraints Imposed by Verification

The requirement of formal verification has profoundly shaped seL4's design:

1. **Small kernel**: ~10,000 lines of C. Every line must be verified, so
   the kernel is as small as possible. Drivers, file systems, networking --
   everything lives in user space.

2. **No dynamic memory allocation in the kernel**: The kernel does not have
   a general-purpose heap. All kernel memory is pre-allocated and managed
   through typed capabilities (Untyped memory). This eliminates an entire
   class of verification complexity (heap reasoning).

3. **No concurrency in the kernel**: seL4 runs the kernel as a single-
   threaded "big lock" model (interrupts disabled in kernel mode). SMP is
   handled by running independent kernel instances on each core with
   explicit message passing between them (the "clustered multikernel"
   approach), or by using a big kernel lock (the current SMP approach, which
   is NOT covered by the verification proof).

4. **C implementation**: Written in a restricted subset of C that is amenable
   to Isabelle/HOL reasoning. No function pointers (mostly), no complex
   pointer arithmetic, no compiler-specific extensions. This makes the code
   more rigid than typical C but provable.

5. **Fixed system call set**: The kernel API is small and fixed. Adding a
   new syscall requires extending the proofs -- a major effort.

6. **Platform-specific verification**: The proof is per-platform. ARM was
   verified first; x86 verification came later with additional effort.
   Each new platform requires new proofs.

---

## 2. Capability Transfer Model

### Core Concepts

seL4's capability model descends from the EROS/KeyKOS tradition but with
significant innovations driven by formal verification requirements.

**Kernel Objects**: Everything the kernel manages is an object: TCBs (thread
control blocks), endpoints (IPC channels), CNodes (capability storage),
page tables, frames, address spaces (VSpaces), untyped memory, and more.
The kernel tracks the exact type and state of every object.

**Capabilities**: A capability is a reference to a kernel object combined
with access rights. Capabilities are stored in kernel memory, never directly
accessible to user space. User space refers to capabilities by position in
its capability space.

### CSpaces, CNodes, and CSlots

**CSlot (Capability Slot)**: A single storage location that can hold one
capability. A CSlot is either empty or contains a capability (object pointer
+ access rights + badge).

**CNode (Capability Node)**: A kernel object that is a power-of-two-sized
array of CSlots. A CNode with `2^n` slots has a "guard" and a "radix" of
`n`. CNodes are the building blocks of the capability addressing tree.

**CSpace (Capability Space)**: The complete capability namespace of a thread.
A CSpace is a *tree* of CNodes, rooted at the thread's CSpace root (a CNode
pointed to by the TCB). Capability lookup traverses this tree.

```
Thread's TCB
  |
  +-- CSpace Root (CNode, 2^8 = 256 slots)
        |
        +-- slot 0: cap to Endpoint A
        +-- slot 1: cap to Frame X
        +-- slot 2: cap to another CNode (2^4 = 16 slots)
        |     |
        |     +-- slot 0: cap to Endpoint B
        |     +-- slot 1: empty
        |     +-- ...
        +-- slot 3: empty
        +-- ...
```

### Capability Addressing (CPtr and Depth)

A **CPtr (Capability Pointer)** is a word-sized integer used to name a
capability within a thread's CSpace. It is NOT a memory pointer -- it is an
index that the kernel resolves by walking the CNode tree.

Resolution works bit-by-bit from the most significant end:

1. Start at the CSpace root CNode.
2. The CNode's **guard** is compared against the corresponding bits of the
   CPtr. If they don't match, the lookup fails. Guards allow sparse
   addressing without allocating huge CNode arrays.
3. The next `radix` bits of the CPtr are used as an index into the CNode
   array.
4. If the slot contains a CNode capability, recurse: consume the next bits
   of the CPtr to walk deeper.
5. If the slot contains any other capability, the lookup is complete.
6. The **depth** parameter in the syscall tells the kernel how many bits
   of the CPtr to consume. This disambiguates between "stop at this CNode
   cap" and "descend into this CNode."

Example: A CPtr of `0x4B` with a two-level CSpace:
- Root CNode: guard = 0, radix = 4 (16 slots)
- Bits [7:4] = 0x4 -> index into root CNode slot 4
- Slot 4 contains a CNode cap: guard = 0, radix = 4 (16 slots)
- Bits [3:0] = 0xB -> index into second-level CNode slot 11
- Slot 11 contains an Endpoint cap -> lookup complete

### Flat Table vs. Hierarchical CSpace

seL4's hierarchical CSpace has significant implications:

**Advantages of hierarchical:**
- Sparse capability spaces without wasting memory. A process can have a
  huge CPtr range with only a few CNodes allocated.
- Subtree delegation: a parent can give a child a CNode cap that grants
  access to a subset of capabilities. The child can manage its own subtree
  without affecting the parent's.
- Guards compress address bits, allowing efficient encoding of large
  capability identifiers.

**Disadvantages of hierarchical:**
- Lookup is slower than a flat array index -- multiple memory indirections
  per resolution.
- More complex kernel code (and more complex verification).
- User space must explicitly manage CNode allocation and CSpace layout.

**capOS comparison**: capOS uses a flat `Vec<Option<Arc<dyn CapObject>>>`
indexed by `CapId` (u32). The shared `Arc` lets a single kernel capability
back multiple per-process slots, which is what makes cross-process IPC work
when another service resolves its `CapRef` via `CapSource::Service`. The flat
layout is simpler and faster for lookup (single array index), but cannot
support sparse addressing or subtree delegation.
For capOS's research goals, the flat approach is adequate initially. If
capOS needs hierarchical delegation later (e.g., a supervisor delegating
a subset of caps to a child without copying), it could add a level of
indirection without adopting seL4's full tree model.

### Capability Operations

seL4 provides these operations on capabilities:

**Copy**: Duplicate a capability from one CSlot to another. Both the source
and destination must be in the caller's CSpace (or the caller must have
CNode caps to the relevant CNodes). The new cap has the same authority as
the original, minus any rights the caller chooses to strip.

**Mint**: Like Copy, but also sets a **badge** on the new capability. A badge
is a word-sized integer embedded in the capability that is delivered to the
receiver when the capability is used. Badges allow a server to distinguish
which client is calling -- each client gets a differently-badged cap to the
same endpoint, and the server sees the badge on each incoming message.

**Move**: Transfer a capability from one CSlot to another. The source slot
becomes empty. This is an atomic transfer of authority.

**Mutate**: Move + modify rights or badge in one operation.

**Delete**: Remove a capability from a CSlot, making it empty.

**Revoke**: Delete a capability AND all capabilities derived from it. This
is the most powerful operation -- it allows a parent to withdraw authority
it granted to children, transitively.

### Capability Derivation and the CDT

seL4 tracks a **Capability Derivation Tree (CDT)** -- a tree recording which
capability was derived from which. When capability A is copied or minted to
produce capability B, B becomes a child of A in the CDT.

`Revoke(A)` deletes all descendants of A in the CDT but leaves A itself.
This gives the holder of A the power to revoke all authority derived from
their own authority.

The CDT is critical for clean revocation but adds significant kernel
complexity. It requires maintaining a tree structure across all capability
copies throughout the system.

### Untyped Memory and Retype

One of seL4's most distinctive features is that **the kernel never allocates
memory on its own**. All physical memory is initially represented as
**Untyped Memory** capabilities. To create any kernel object (endpoint, CNode,
TCB, page frame, etc.), user space must invoke the `Untyped_Retype` operation
on an untyped cap, which carves out a portion of the untyped memory and
creates a new typed object.

This means:
- User space (specifically, the root task or a memory manager) controls all
  memory allocation.
- The kernel has zero internal allocation -- all memory it uses comes from
  retyped untypeds.
- Memory exhaustion is impossible in the kernel -- if a syscall needs memory,
  user space must have provided it in advance via retype.
- Revoke on an untyped cap destroys ALL objects created from it, reclaiming
  the memory. This is the mechanism for wholesale cleanup.

---

## 3. IPC Fastpath

### Overview

seL4's IPC is synchronous and endpoint-based. An endpoint is a rendezvous
point: the sender blocks until a receiver is ready, or vice versa. There is
no buffering in the kernel (unlike Mach ports or Linux pipes).

The IPC **fastpath** is a highly optimized code path for the common case of
a short synchronous call/reply between two threads. It is one of seL4's
signature performance features.

### How the Fastpath Works

When thread A calls `seL4_Call(endpoint_cap, msg)`:

1. **Capability lookup**: Resolve the CPtr to find the endpoint cap. In the
   fastpath, this is optimized to handle the common case of a direct CSlot
   lookup (single-level CSpace, no guard traversal needed).

2. **Receiver check**: Is there a thread waiting on this endpoint? If yes,
   the fastpath applies. If no (receiver isn't ready), fall to the slowpath
   which queues the sender.

3. **Direct context switch**: Instead of the normal path (save sender
   registers -> return to scheduler -> pick receiver -> restore receiver
   registers), the fastpath performs a **direct register transfer**:
   - Save the sender's register state into its TCB.
   - Copy the message registers (a small number, typically 4-8 words) from
     the sender's physical registers directly into the receiver's TCB (or
     leave them in registers if possible).
   - Load the receiver's page table root (vspace) into CR3/TTBR.
   - Switch to the receiver's kernel stack.
   - Restore the receiver's register state.
   - Return to user mode as the receiver.

   This is a **direct context switch** -- the kernel goes directly from
   the sender to the receiver without passing through the scheduler. The
   IPC operation IS the context switch.

4. **Reply cap**: The sender's reply cap is set up so the receiver can
   reply. In the classic (non-MCS) model, a one-shot reply capability is
   placed in the receiver's TCB. The receiver calls `seL4_Reply(msg)` to
   send the response directly back.

### Performance Characteristics

seL4 IPC is among the fastest measured:

- **ARM (Cortex-A9)**: ~240 cycles for a Call+Reply round-trip (including
  two privilege transitions, a full context switch, and message transfer).
- **x86-64**: ~380-500 cycles for a Call+Reply round-trip depending on
  hardware generation.
- **Message size**: The fastpath handles small messages (fits in registers).
  Longer messages require copying from IPC buffer pages and take the slowpath.

For comparison:
- Linux `pipe` IPC: ~5,000-10,000 cycles for a round-trip.
- Mach IPC (macOS XNU): ~3,000-5,000 cycles.
- L4/Pistachio: ~700-1,000 cycles (seL4 improved on this).

### Fastpath Constraints

The fastpath is only taken when ALL of these conditions hold:

1. The operation is `seL4_Call` or `seL4_ReplyRecv` (the two most common
   IPC operations).
2. The message fits in message registers (no extra caps, no long messages
   that require the IPC buffer).
3. The capability lookup is "simple" -- single-level CSpace, direct slot
   lookup, no guard bits to check.
4. There IS a thread waiting at the endpoint (no need to block the sender).
5. The receiver is at sufficient priority (in the non-MCS configuration,
   higher priority than any other runnable thread -- or in MCS, the
   scheduling context can be donated).
6. No capability transfer is happening in this message.
7. Certain bookkeeping conditions are met (no pending operations on either
   thread, no debug traps, etc.).

When any condition fails, the kernel falls through to the **slowpath**,
which handles the general case correctly but with more overhead (~5-10x
slower than the fastpath).

### Direct Switch Mechanics

The key insight is: **when thread A calls thread B synchronously, A is
going to block until B replies.** There is no scheduling decision to make --
the only correct action is to run B immediately. So the kernel skips the
scheduler entirely:

```
Thread A (running)          Kernel              Thread B (blocked on recv)
    |                         |                        |
    | seL4_Call(ep, msg) ---> |                        |
    |                         | [fastpath]             |
    |                         | Save A's regs          |
    |                         | Copy msg A -> B        |
    |                         | Switch page tables     |
    |                         | Restore B's regs       |
    |                         | ---------------------->|
    |                         |                        | [running, processes msg]
    |                         |                        |
    |                         | <--- seL4_Reply(reply)  |
    |                         | [fastpath again]       |
    |                         | Save B's regs          |
    |                         | Copy reply B -> A      |
    |                         | Switch page tables     |
    |                         | Restore A's regs       |
    | <-----------------------|                        |
    | [running, has reply]    |                        |
```

The entire round-trip involves exactly two kernel entries and two context
switches, with no scheduler invocation.

### Implications

1. **RPC is the natural IPC pattern**: seL4's IPC is optimized for the
   client-server call/reply pattern. Fire-and-forget or multicast patterns
   require different mechanisms (notifications, shared memory).

2. **Notifications**: For async signaling (like interrupts or events),
   seL4 provides **notification objects** -- a lightweight word-sized
   bitmask that can be signaled and waited on without message transfer.
   These are separate from endpoints.

3. **Shared memory for bulk transfer**: IPC messages are small (register-
   sized). For large data transfers, the standard pattern is: set up shared
   memory, then use IPC to synchronize. This is explicit -- the kernel
   doesn't transparently copy large buffers.

---

## 4. CNode/CSpace Architecture in Detail

### CNode Structure

A CNode object is a contiguous array of CSlots in kernel memory. The size
is always a power of two. The kernel metadata for a CNode includes:

- **Radix bits**: log2 of the number of slots (e.g., radix=8 means 256 slots).
- **Guard value**: a bit pattern that must match the CPtr during resolution.
- **Guard bits**: the number of bits in the guard.

The total bits consumed during resolution of one CNode level is:
`guard_bits + radix_bits`.

### Multi-Level Resolution Example

Consider a two-level CSpace:

```
Root CNode: guard=0 (0 bits), radix=8 (256 slots)
  Slot 5 -> CNode B: guard=0x3 (2 bits), radix=6 (64 slots)
    Slot 42 -> Endpoint X
```

To reach Endpoint X with a 16-bit CPtr at depth 16:
- CPtr = 0b 00000101 11 101010
- Root CNode consumes 8 bits: 00000101 = 5 -> Slot 5 (CNode B cap)
- CNode B guard: next 2 bits = 11 -> matches guard 0x3 -> OK
- CNode B radix: next 6 bits = 101010 = 42 -> Slot 42 (Endpoint X)
- Total bits consumed: 8 + 2 + 6 = 16 = depth -> resolution complete

### CSpace Layout Strategies

**Flat**: One large root CNode with radix=N, no sub-CNodes. Simple, fast
lookup (one level). Wastes memory if the CPtr space is sparse.

**Two-level**: Small root CNode pointing to sub-CNodes. Common for
processes that need moderate capability counts.

**Deep**: Many levels. Useful for delegation: a supervisor gives a child a
cap to a sub-CNode, and the child manages its own CSpace subtree below that
point.

### Comparison with capOS's Flat Table

| Aspect | seL4 CSpace | capOS CapTable |
|--------|-------------|----------------|
| Structure | Tree of CNodes | `Flat Vec<Option<Arc<dyn CapObject>>>` |
| Lookup cost | O(depth) memory indirections | O(1) array index |
| Sparse support | Yes (guards + tree) | No (dense array, holes via free list) |
| Subtree delegation | Yes (grant CNode cap) | No |
| Memory overhead | CNode objects are power-of-2 | Exact-sized Vec |
| Complexity | High (bit-level CPtr resolution) | Low |
| Capability identity | Position in CSpace | CapId (u32 index) |
| Verification burden | Very high | N/A (Rust safety) |

---

## 5. MCS (Mixed-Criticality Systems) Scheduling

### Background

The original seL4 scheduling model is a simple priority-preemptive scheduler
with 256 priority levels and round-robin within each level. This model has a
known flaw: **priority inversion through IPC**. When a high-priority thread
calls a low-priority server, the reply might be delayed indefinitely by
medium-priority threads preempting the server. The classic solution (priority
inheritance) is complex to verify and doesn't compose well.

The MCS extensions redesign scheduling to solve this and provide temporal
isolation.

### Key Concepts

**Scheduling Context (SC)**: A new kernel object that represents the "right
to execute on a CPU." An SC holds:
- A budget (microseconds of CPU time per period)
- A period
- A priority
- Remaining budget in the current period

A thread must have a bound SC to be runnable. Without an SC, a thread cannot
execute regardless of its priority.

**Reply Object**: In the MCS model, the one-shot reply capability from
classic seL4 is replaced by an explicit Reply kernel object. When thread A
calls thread B:
1. A's scheduling context is **donated** to B.
2. A reply object is created to hold A's return path.
3. B now runs on A's scheduling context (A's priority and budget).
4. When B replies, the SC returns to A.

This solves priority inversion: the server (B) inherits the caller's
priority and budget automatically.

**Passive servers**: A server thread can exist without its own SC. It only
becomes runnable when a client donates an SC via the Call operation. When it
replies, it becomes passive again. This is powerful:
- No CPU time is "reserved" for idle servers.
- The server executes on the client's budget -- the client pays for
  the work it requests.
- Multiple clients can call the same passive server; each brings its
  own SC.

### Temporal Isolation

MCS SCs provide **temporal fault isolation**:

- Each SC has a fixed budget/period. A thread cannot exceed its budget
  in any period. When the budget expires, the thread is descheduled until
  the next period begins.
- This is enforced by hardware timer interrupts -- the kernel programs the
  timer to fire when the current SC's budget expires.
- A misbehaving (or compromised) component cannot starve other components
  because its SC bounds its CPU consumption.
- This works even across IPC: if client A calls server B with A's SC,
  the combined execution of A+B is bounded by A's budget.

### Comparison with capOS's Scheduler

capOS currently has a round-robin scheduler (`kernel/src/sched.rs`) with no
priority levels and no temporal isolation:

```rust
struct Scheduler {
    processes: BTreeMap<Pid, Process>,
    run_queue: VecDeque<Pid>,
    current: Option<Pid>,
}
```

Timer preemption, `cap_enter` blocking waits, Endpoint IPC, and a baseline
direct IPC handoff are implemented. The MCS model is relevant for the next
scheduling step because the same priority inversion problem arises when a
high-priority client calls a low-priority server through a capability.

---

## 6. Relevance to capOS

### 6.1 Formal Verification

**Applicability**: Low in the near term. seL4's verification is done in
Isabelle/HOL over C code, which doesn't transfer to Rust. However, the
*constraints* that verification imposed are valuable design guidance:

- **Minimal kernel**: seL4's ~10K lines of C demonstrate how little code
  a microkernel actually needs. capOS should resist adding kernel features
  and instead move them to user space.
- **No kernel heap allocation on the critical path**: seL4's "untyped memory"
  approach where user space provides all memory is worth studying. capOS has
  removed the earlier allocation-heavy synchronous ring dispatch path, but it
  still uses owned kernel objects and preallocated scratch rather than a
  user-supplied untyped-memory model.
- **No kernel concurrency**: seL4 avoids kernel-level concurrency entirely
  (SMP uses separate kernel instances or a big lock). capOS currently uses
  `spin::Mutex` around the scheduler and capability tables. The seL4
  approach suggests this is acceptable until/unless per-CPU kernel instances
  are needed.

**Rust alternative**: Rust's type system provides memory safety guarantees
that overlap with some of seL4's verified properties (no buffer overflows,
no use-after-free, no null dereference in safe code). This is not a
substitute for functional correctness proofs, but it significantly raises
the bar compared to unverified C. Ongoing research in Rust formal
verification (e.g., Prusti, Creusot, Verus) may eventually enable
seL4-style proofs over Rust kernels.

### 6.2 Capability Model

**CNode tree vs. flat table**: capOS's flat `CapTable` is the right choice
for now. seL4's CNode tree exists to support delegation (granting a subtree
of your CSpace to a child) and sparse addressing. capOS's current model
gives each process its own independent flat table and now supports
manifest-provided caps plus explicit copy/move transfer descriptors through
Endpoint IPC. If capOS later needs fine-grained delegation (a parent granting
access to a subset of its caps without copying), it can add a level of
indirection:

```
Option A: Proxy capability objects that forward to the parent's table
Option B: A two-level table (small root array -> larger sub-arrays)
Option C: Shared capability objects with refcounting
```

**Badge/Mint pattern**: seL4's badge mechanism was initially applied to capOS
as endpoint receiver metadata: multiple clients could share one endpoint while
the server saw a word-sized caller tag. capOS implemented that substrate by
adding badge metadata to capability references and hold edges; endpoint CALL
delivery reported the invoked hold badge to the receiver, and copy/move
transfer preserved badge metadata.

That model is now historical. Badge-as-service-identity was rejected after
spawn and shell paths exposed delegated-client relabeling hazards. The active
direction is session-bound invocation context: endpoint metadata may remain as
internal receiver metadata or hostile-test fixture, but normal shared-service
identity should come from process session context, broker-granted service
facets, and privacy-bounded disclosure. See
`docs/proposals/rejected-endpoint-badges-proposal.md` and
`docs/proposals/session-bound-invocation-context-proposal.md`.

Current ring SQEs carry cap id and method id separately. The cap table stores
badge and transfer-mode metadata alongside the object reference:

```rust
struct CapEntry {
    object: Arc<dyn CapObject>,
    badge: u64,
    transfer_mode: CapTransferMode,
}
```

**Revocation (CDT)**: seL4's Capability Derivation Tree is its most complex
internal structure. For capOS, full CDT-style transitive revocation is
probably overkill initially. The service-architecture proposal already
identifies simpler alternatives:
- **Generation counters**: Each capability has a generation number. Bumping
  the generation invalidates all references without traversing a tree.
- **Proxy caps**: A proxy object that can be invalidated by its creator.
  Callers hold the proxy, not the real capability.
- **Process-lifetime revocation**: When a process dies, all caps it held
  are automatically invalidated (seL4 does this too, but the CDT allows
  more fine-grained revocation within a living process).

**Untyped memory**: seL4's "no kernel allocation" model via untyped memory
is elegant but probably too heavyweight for capOS's current stage. The key
takeaway is the *principle*: user space should control resource allocation
as much as possible. capOS's `FrameAllocator` capability already moves frame
allocation authority into the capability model.

### 6.3 IPC Design

This is the most directly actionable area for capOS's Stage 6.

**seL4's model (synchronous rendezvous + direct switch) vs. capOS's model
(async rings + Cap'n Proto wire format):**

| Aspect | seL4 | capOS |
|--------|------|-----------------|
| IPC primitive | Synchronous endpoint | Async submission/completion rings |
| Message format | Untyped words in registers | Cap'n Proto serialized messages |
| Bulk transfer | Shared memory (explicit) | TBD (copy in kernel or shared memory) |
| Message size | Small (register-sized, ~4-8 words) | Variable (up to 64KB currently) |
| Scheduling integration | Direct switch (caller -> callee) | Baseline direct IPC handoff implemented |
| Batching | No (one message per syscall) | Yes (io_uring-style ring) |

**Key lessons from seL4's IPC for capOS:**

1. **Direct switch for synchronous RPC**: Even with async rings, capOS needs a
   synchronous fast path. The baseline single-CPU direct IPC handoff is
   implemented for the case where process A calls an Endpoint and process B is
   blocked waiting in RECV. Future work is register payload transfer and
   measured fastpath tuning.

2. **Register-based message transfer for small messages**: seL4 avoids
   copying message bytes through kernel buffers for small messages by
   transferring them through registers during the context switch. capOS
   currently moves serialized payloads through ring buffers and bounded kernel
   scratch. For cross-process IPC, minimizing copies is critical. Options:
   - Small messages (<64 bytes) could be transferred in registers during
     direct switch.
   - Large messages could use shared memory regions (mapped into both
     address spaces) with IPC used only for synchronization.
   - The io_uring-style rings are already shared memory -- the submission
     and completion ring buffers could potentially be mapped into both
     the caller's and callee's address spaces for zero-copy IPC.

3. **Separate mechanisms for sync and async**: seL4 uses endpoints for
   synchronous IPC and notification objects for async signaling. capOS's
   io_uring approach inherently supports batched async operations, but the
   common case of a simple RPC call-and-wait should have a fast synchronous
   path too. The two mechanisms complement each other.

4. **Notifications for interrupts and events**: seL4's notification objects
   (lightweight bitmask signal/wait) map well to capOS's interrupt delivery
   model. When a hardware interrupt fires, the kernel signals a notification
   object, and the driver thread waiting on that notification wakes up. This
   is cleaner than delivering interrupts as full IPC messages.

**The Cap'n Proto dimension**: capOS's use of Cap'n Proto wire format for
capability messages is a significant divergence from seL4's untyped word
arrays. Tradeoffs:

- **Pro**: Type safety, schema evolution, language-neutral interfaces, built-in
  serialization/deserialization, native support for capability references
  in messages (Cap'n Proto has a "capability table" concept in its RPC
  protocol).
- **Con**: Serialization overhead. Even Cap'n Proto's zero-copy format requires
  pointer validation and bounds checking that seL4's raw register transfer
  does not. For very hot IPC paths, this overhead may be significant.
- **Mitigation**: For the hot path, capOS could define a "small message"
  format that bypasses full capnp serialization -- just a few raw words,
  similar to seL4's register message. Fall back to full capnp for larger
  or more complex messages.

### 6.4 MCS Scheduling

**Priority donation via IPC**: Directly relevant when capOS implements
cross-process capability calls. If process A (high priority) calls a
capability in process B (low priority), B needs to run at A's priority to
avoid inversion. The seL4 MCS approach of "donating" the scheduling context
with the IPC message is clean and composable.

For capOS, the io_uring model complicates this slightly: if submissions are
batched, which submitter's priority should the server inherit? Options:
- Inherit the highest priority among pending submissions.
- Each submission carries its own priority/scheduling context.
- Use the synchronous fast-path (with donation) for priority-sensitive
  calls, and the async ring for bulk/background operations.

**Passive servers**: The MCS concept of servers that only consume CPU when
called (by borrowing the caller's scheduling context) maps well to capOS's
capability-based services. A network stack server that only runs when a
client sends a request, consuming the client's CPU budget, is a natural
fit for capOS's service architecture.

**Temporal isolation**: Budget/period enforcement prevents denial-of-service
between capability holders. Even if process A holds a capability to process
B, A cannot cause B to consume unbounded CPU time -- B's execution on behalf
of A is bounded by A's scheduling context budget. This is worth considering
for capOS's roadmap, especially for the cloud deployment scenario where
isolation is critical.

### 6.5 Specific Recommendations for capOS

**Near-term (Stages 5-6):**

1. **Badge field on cap holds**: Done. Manifest `CapRef` badge metadata is
   carried into cap-table hold edges, delivered to Endpoint receivers, and
   preserved across copy/move transfer.

2. **Implement direct-switch IPC for synchronous calls**: Baseline done for
   Endpoint receivers blocked in RECV. Remaining work is the measured fastpath
   shape, especially small-message register transfer.

3. **Keep the flat CapTable**: seL4's CNode tree complexity is justified by
   formal verification constraints and subtree delegation. capOS's flat
   table is simpler and sufficient. Add proxy/wrapper capabilities for
   delegation rather than restructuring the table.

4. **Add notification objects**: A lightweight signaling primitive (word-
   sized bitmask, signal/wait operations) for interrupt delivery and
   event notification. Much cheaper than sending a full capnp message for
   "wake up, there's work to do."

**Medium-term (post-Stage 6):**

5. **Scheduling context donation**: When implementing priority scheduling,
   attach a scheduling context to IPC calls so servers inherit caller
   priority. This prevents priority inversion through the capability graph.

6. **Capability rights attenuation**: Add a rights mask to capability
   references so a parent can grant a cap with reduced permissions (e.g.,
   read-only access to a read-write capability). seL4's rights bits are:
   Read, Write, Grant (can pass the cap to others), GrantReply (can pass
   reply cap only).

7. **Revocation via generation/epoch counters**: Generation-tagged `CapId`s
   catch stale slot reuse, and object-wide epoch revocation now invalidates
   current child-local grant copies without a seL4-style derivation tree.

**Long-term (research directions):**

8. **Zero-copy IPC via shared memory**: For bulk data transfer between
   processes, map shared memory regions (Cap'n Proto segments) into both
   address spaces. Use IPC only for synchronization and capability transfer.
   This combines seL4's "shared memory + IPC sync" pattern with capOS's
   Cap'n Proto wire format.

9. **Rust-native verification**: Track developments in Verus, Prusti, and
   other Rust verification tools. capOS's Rust implementation is better
   positioned for future formal verification than a C implementation would
   be, given the type system guarantees already present.

10. **Untyped memory model**: Consider moving kernel object allocation
    entirely into capability-gated operations (like seL4's Retype). User
    space provides memory for all kernel objects, ensuring the kernel never
    runs out of memory on its own. This is a significant architectural
    change but aligns with the "everything is a capability" principle.

---

## Summary Table

| seL4 Feature | Maturity | capOS Equivalent | Recommended Action |
|---|---|---|---|
| Functional correctness proof | Production | None (Rust type safety) | Track Rust verification tools |
| CNode/CSpace tree | Production | Flat CapTable | Keep flat |
| Capability badge/mint | Production | Hold-edge badge | Done baseline |
| Revocation (CDT) | Production | Generation-tagged `CapId`; object-epoch revocation for child-local grants | Keep epoch revocation instead of adding CDT |
| Untyped memory / Retype | Production | FrameAllocator cap | Consider for hardening phase |
| Synchronous IPC endpoints | Production | Endpoint CALL/RECV/RETURN | Done baseline |
| IPC fastpath (direct switch) | Production | Direct IPC handoff | Done baseline; tune register payload later |
| Notification objects | Production | None | Implement as lightweight signal primitive |
| MCS Scheduling Contexts | Production | Round-robin scheduler | Implement SC donation for IPC |
| Passive servers | Production | None | Natural fit with cap-based services |
| Temporal isolation | Production | None | Consider for cloud deployment |

---

## References

1. Klein, G., et al. "seL4: Formal Verification of an OS Kernel." SOSP 2009.
2. seL4 Reference Manual, versions 12.1.0 and 13.0.0.
3. "The seL4 Microkernel -- An Introduction." seL4 Foundation Whitepaper, 2020.
4. Lyons, A., et al. "Scheduling-context capabilities: A principled, light-weight
   operating-system mechanism for managing time." EuroSys 2018.
5. Heiser, G., & Elphinstone, K. "L4 Microkernels: The Lessons from 20 Years of
   Research and Deployment." SOSP 2016.
6. seL4 source code: https://github.com/seL4/seL4
7. seL4 API documentation: https://docs.sel4.systems/
