# OS Error Handling in Capability Systems: Research Notes

Research on error handling patterns in capability-based and microkernel
operating systems. Used as input for the capOS error handling proposal.

---

## 1. seL4

### Error Codes

seL4 defines 11 kernel error codes in `errors.h`:

```c
typedef enum {
    seL4_NoError            = 0,
    seL4_InvalidArgument    = 1,
    seL4_InvalidCapability  = 2,
    seL4_IllegalOperation   = 3,
    seL4_RangeError         = 4,
    seL4_AlignmentError     = 5,
    seL4_FailedLookup       = 6,
    seL4_TruncatedMessage   = 7,
    seL4_DeleteFirst        = 8,
    seL4_RevokeFirst        = 9,
    seL4_NotEnoughMemory    = 10,
} seL4_Error;
```

### Error Return Mechanism

- Capability invocations (kernel object operations) return `seL4_Error`
  directly.
- IPC messages use `seL4_MessageInfo_t` with `label`, `length`, `extraCaps`,
  `capsUnwrapped`. The `label` is copied unmodified -- kernel doesn't
  interpret it.
- MR0 (Message Register 0) carries return codes for kernel object invocations
  via `seL4_Call`.

### Error Propagation

Fault handler mechanism: each TCB has a fault endpoint capability. On fault
(capability fault, VM fault, etc.):

1. Kernel blocks the faulting thread.
2. Kernel sends an IPC to the fault endpoint with fault-type-specific fields.
3. Fault handler (separate process) receives, fixes, and replies.
4. Kernel resumes the faulting thread.

### Design Choices

- `seL4_NBSend` on invalid capability: **silently fails** (prevents covert
  channels).
- `seL4_Send`/`seL4_Call` on invalid capability: returns
  `seL4_FailedLookup`.
- No application-level error convention -- user servers choose their own
  protocol.
- Partial capability transfer: if some caps in a multi-cap transfer fail,
  already-transferred caps succeed; `extraCaps` reflects the successful
  count.

### Sources

- seL4 errors.h: https://github.com/seL4/seL4/blob/master/libsel4/include/sel4/errors.h
- seL4 IPC tutorial: https://docs.sel4.systems/Tutorials/ipc.html
- seL4 fault handlers: https://docs.sel4.systems/Tutorials/fault-handlers.html
- seL4 API reference: https://docs.sel4.systems/projects/sel4/api-doc.html

---

## 2. Fuchsia / Zircon

### zx_status_t

Signed 32-bit integer. Negative = error, `ZX_OK` (0) = success.

**Categories:**

| Category | Examples |
|---|---|
| General | `ZX_ERR_INTERNAL`, `ZX_ERR_NOT_SUPPORTED`, `ZX_ERR_NO_RESOURCES`, `ZX_ERR_NO_MEMORY` |
| Parameter | `ZX_ERR_INVALID_ARGS`, `ZX_ERR_WRONG_TYPE`, `ZX_ERR_BAD_HANDLE`, `ZX_ERR_BUFFER_TOO_SMALL` |
| State | `ZX_ERR_BAD_STATE`, `ZX_ERR_NOT_FOUND`, `ZX_ERR_TIMED_OUT`, `ZX_ERR_ALREADY_EXISTS`, `ZX_ERR_PEER_CLOSED` |
| Permission | `ZX_ERR_ACCESS_DENIED` |
| I/O | `ZX_ERR_IO`, `ZX_ERR_IO_REFUSED`, `ZX_ERR_IO_DATA_INTEGRITY`, `ZX_ERR_IO_DATA_LOSS` |

### FIDL Error Handling (Three Layers)

**Layer 1: Transport errors.** Channel broke. Currently all transport-level
FIDL errors close the channel. Client observes `ZX_ERR_PEER_CLOSED`.

**Layer 2: Epitaphs (RFC-0053).** Server sends a special final message
before closing a channel, explaining why. Wire format: ordinal `0xFFFFFFFF`,
error status in the reserved `uint32` of the FIDL message header. After
sending, server closes the channel.

**Layer 3: Application errors (RFC-0060).** Methods declare error types:

```fidl
Method() -> (string result) error int32;
```

Serialized as:

```
union MethodReturn {
    MethodResult result;
    int32 err;
};
```

Error types constrained to `int32`, `uint32`, or an enum thereof. Deliberately
**no standard error enum** -- each service defines its own error domain.
Rationale: standard error enums "try to capture more detail than we think is
appropriate."

C++ binding: `zx::result<T>` (specialization of `fit::result<zx_status_t, T>`).

### Sources

- Zircon errors: https://fuchsia.dev/fuchsia-src/concepts/kernel/errors
- RFC-0060 error handling: https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0060_error_handling
- RFC-0053 epitaphs: https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0053_epitaphs

---

## 3. EROS / KeyKOS / Coyotos

### KeyKOS Invocation Message Format

```
KC (Key, Order_code)
   STRUCTFROM(arg_structure)
   KEYSFROM(arg_key_slots)
   STRUCTTO(reply_structure)
   KEYSTO(reply_key_slots)
   RCTO(return_code_variable)
```

- **Order code**: small integer selecting the operation (method selector).
- **Return code**: integer returned by the invoked object via `RCTO`.
- **Data string**: bulk data parameter (up to ~4KB).
- **Keys**: up to 4 capability parameters in each direction.

### Invocation Primitives

- **CALL**: send + block for reply. Kernel synthesizes a *resume key*
  (capability to resume caller) as 4th key parameter to callee.
- **RETURN**: reply using a resume key + go back to waiting.
- **FORK**: send and continue (fire-and-forget).

### Keeper Error Handling

Every domain has a *domain keeper* slot. On hardware trap (illegal
instruction, divide-by-zero, protection fault):

1. Kernel invokes the keeper *as if* the domain had issued a CALL.
2. Keeper receives fault information in the message.
3. Keeper can fix and resume (via resume key) or terminate.
4. A non-zero return code from a key invocation triggers the keeper mechanism.

### Coyotos (EROS Successor) -- Formalized Error Model

Cleanly separates invocation-level vs application-level exceptions:

**Invocation-level** (before the target processes the message):
`MalformedSyscall`, `InvalidAddress`, `AccessViolation`,
`DataAccessTypeError`, `CapAccessTypeError`, `MalformedSpace`,
`MisalignedReference`

**Application-level**: signaled via `OPR0.ex` flag bit in the reply control
word. If set, remaining parameter words contain a 64-bit exception code
plus optional info.

### Sources

- KeyKOS architecture: https://dl.acm.org/doi/pdf/10.1145/858336.858337
- Coyotos spec: https://hydra-www.ietfng.org/capbib/cache/shapiro:coyotosspec.html
- EROS (SOSP 1999): https://sites.cs.ucsb.edu/~chris/teaching/cs290/doc/eros-sosp99.pdf

---

## 4. Plan 9 / 9P

### 9P2000 Rerror Format

```
size[4] Rerror tag[2] ename[s]
```

- `ename[s]`: variable-length UTF-8 string describing the error.
- No `Terror` message -- only servers send errors.
- String-based, not numeric. Conventional strings ("permission denied",
  "file not found") but no fixed taxonomy.

### 9P2000.u Extension (Unix compatibility)

```
size[4] Rerror tag[2] ename[s] errno[4]
```

Adds a 4-byte Unix errno as a *hint*. Clients should prefer the string.
`ERRUNDEF` sentinel when Unix errno doesn't apply.

### Design Rationale

Avoids "errno fragmentation" where different Unix variants assign different
numbers to the same condition. The string is authoritative; the number is
an optimization for Unix-compatibility clients.

### Sources

- 9P2000 RFC: https://ericvh.github.io/9p-rfc/rfc9p2000.html
- 9P2000.u RFC: https://ericvh.github.io/9p-rfc/rfc9p2000.u.html

---

## 5. Genode

### RPC Exception Propagation

```cpp
GENODE_RPC_THROW(func_type, ret_type, func_name,
                 GENODE_TYPE_LIST(Exception1, Exception2, ...),
                 arg_type...)
```

Only the exception **type** crosses the boundary -- exception objects (fields,
messages) are not transferred. Server encodes a numeric `Rpc_exception_code`,
client reconstructs a default-constructed exception of the matching type.

**Undeclared exceptions**: undefined behavior (server crash or hung RPC).

### Infrastructure-Level Errors

- `RPC_INVALID_OPCODE`: dispatched operation code doesn't match.
- `Rpc_exception_code`: integral type, computed as
  `RPC_EXCEPTION_BASE - index_in_exception_list`.
- `Ipc_error`: kernel IPC failure (server unreachable).
- Server death: capabilities become invalid, subsequent invocations
  produce `Ipc_error`.

### Sources

- Genode RPC: https://genode.org/documentation/genode-foundations/20.05/functional_specification/Remote_procedure_calls.html
- Genode IPC: https://genode.org/documentation/genode-foundations/23.05/architecture/Inter-component_communication.html

---

## 6. Cross-System Comparison: Transport vs Application Errors

Every capability/microkernel IPC system separates two failure modes:

1. **Transport errors** -- the invocation mechanism failed before the target
   processed the request (bad handle, insufficient rights, target dead,
   malformed message, timeout).

2. **Application errors** -- the service processed the request and returned
   a meaningful error (not found, resource exhausted, invalid operation).

| System | Transport errors | Application errors |
|--------|------------------|--------------------|
| seL4 | `seL4_Error` (11 values) from syscall | IPC message payload (user-defined) |
| Zircon | `zx_status_t` (~30 values) from syscall | FIDL per-method error type |
| EROS/Coyotos | Invocation exceptions (kernel) | `OPR0.ex` flag + code in reply |
| Plan 9 | Connection loss | `Rerror` with string |
| Genode | `Ipc_error` + `RPC_INVALID_OPCODE` | C++ exceptions via `GENODE_RPC_THROW` |
| Cap'n Proto RPC | `disconnected`/`unimplemented` | `failed`/`overloaded` or schema types |

Common pattern: **small kernel error code set** for transport + **typed
service-specific errors** for application.

---

## 7. POSIX errno: Strengths and Weaknesses for Capability Systems

### Strengths

- Simple (single integer, zero overhead on success).
- Universal (every Unix developer knows it).
- Low overhead (no allocation on error path).

### Weaknesses for Capability Systems

- **Ambient authority assumption**: `EACCES`/`EPERM` assume ACL-style access
  control. In capability systems, having the capability IS the permission.
- **Global flat namespace**: all errors share one integer space. Capability
  systems have typed interfaces; errors should be scoped per-interface.
- **No structured information**: just an integer, no "which argument" or
  "how much memory needed."
- **Thread-local state**: clobbered by intermediate calls, breaks down with
  async IPC or promise pipelining.
- **No transport/application distinction**: `EBADF` (transport) and `ENOENT`
  (application) in the same space.
- **Not composable across trust boundaries**: callee's errno meaningless in
  caller's address space without explicit serialization.

No capability system uses a POSIX-style global errno namespace.
