Research: EROS, CapROS, and Coyotos
Deep analysis of persistent capability operating systems and their relevance to capOS.
1. EROS (Extremely Reliable Operating System)
1.1 Overview
EROS was designed and implemented by Jonathan Shapiro and collaborators at the University of Pennsylvania, starting in the late 1990s. It is a pure capability system descended from KeyKOS (developed at Key Logic in the 1980s). EROS’s defining feature is orthogonal persistence: the entire system state – processes, memory, capabilities – is transparently persistent. There is no distinction between “in memory” and “on disk.”
Key papers:
- Shapiro, J. S., Smith, J. M., & Farber, D. J. “EROS: A Fast Capability System” (SOSP 1999)
- Shapiro, J. S. “EROS: A Capability System” (PhD dissertation, 1999)
- Shapiro, J. S. & Weber, S. “Verifying the EROS Confinement Mechanism” (IEEE S&P 2000)
1.2 The Single-Level Store
In a conventional OS, memory and storage are separate address spaces with different APIs (read/write vs mmap/file I/O). The programmer is responsible for explicitly loading data from disk into memory, modifying it, and writing it back. This creates an impedance mismatch that is the source of enormous complexity (serialization, caching, crash consistency, etc.).
EROS eliminates this distinction with a single-level store:
- All objects (processes, memory pages, capability nodes) exist in a unified persistent object space.
- There is no “file system” and no “load/save.” Objects simply exist.
- The system periodically checkpoints the entire state to disk. Between checkpoints, modified pages are held in memory. After a crash, the system restores to the last consistent checkpoint.
- From the application’s perspective, memory IS storage. There is no API for persistence – it happens automatically.
The single-level store in EROS operates on two primitive object types:
- Pages – 4KB data pages (the equivalent of both memory pages and file blocks).
- Nodes – 32-slot capability containers (the equivalent of both process state and directory entries).
Every page and node has a persistent identity (an Object ID, or OID). The kernel maintains an in-memory object cache and demand-pages objects from disk as needed. Modified objects are written back during checkpoints.
1.3 Checkpoint/Restart
EROS uses a consistent checkpoint mechanism inspired by KeyKOS:
How it works:
- The kernel periodically initiates a checkpoint (KeyKOS used a 5-minute interval; EROS used a configurable interval, typically seconds to minutes).
- All processes are momentarily frozen.
- The kernel snapshots the current state:
- All dirty pages are marked for write-back.
- All node state (capability tables, process descriptors) is serialized.
- A consistent snapshot of the entire system is captured.
- Processes resume immediately – they continue modifying their own copies of pages (copy-on-write semantics ensure the checkpoint image is stable while new modifications accumulate).
- The snapshot is written to disk asynchronously while processes continue running.
- Once the write completes, the checkpoint is atomically committed (a checkpoint header on disk is updated).
What state is captured:
- All memory pages (dirty pages since last checkpoint).
- All nodes (capability slots, process registers, scheduling state).
- The kernel’s object table (mapping OIDs to disk locations).
- The capability graph (which process holds which capabilities).
Recovery after crash:
- On boot, the kernel reads the last committed checkpoint header.
- The system resumes from that exact state. All processes continue as if nothing happened (they may have lost a few seconds of work since the last checkpoint).
- No fsck, no journal replay, no application-level recovery logic.
Performance characteristics:
- Checkpoint cost is proportional to the number of dirty pages since the last checkpoint, not total system size.
- Copy-on-write minimizes pause time – processes are frozen only long enough to mark pages, not to write them.
- EROS achieved checkpoint times of a few milliseconds for the freeze phase, with asynchronous write-back taking longer depending on dirty set size.
- The 1999 SOSP paper reported IPC performance within 2x of L4 (the fastest microkernel at the time) despite the persistence overhead.
1.4 Capabilities: Keys, Nodes, and Domains
EROS (following KeyKOS) uses a specific capability model with three fundamental concepts:
Keys (capabilities):
A key is an unforgeable reference to an object. Keys are the ONLY way to access anything in the system. There are several types:
- Page keys – reference a persistent page. Can be read-only or read-write.
- Node keys – reference a node (a 32-slot capability container). Can be read-only.
- Process keys (called “domain keys” in KeyKOS) – reference a process, allowing control operations (start, stop, set registers).
- Number keys – encode a 96-bit value directly in the key (no indirection). Used for passing constants through the capability mechanism.
- Device keys – reference hardware device registers.
- Forwarder keys – indirection keys used for revocation (see below).
- Void keys – null/invalid keys, used as placeholders.
Nodes:
A node is a persistent container of exactly 32 key slots (in KeyKOS; EROS varied this slightly). Nodes serve multiple purposes:
- Address space description: A tree of nodes with page keys at the leaves defines a process’s virtual address space. The kernel walks this tree to resolve virtual addresses to physical pages (analogous to page tables, but persistent and capability-based).
- Capability storage: A process’s “capability table” is a node tree.
- General-purpose data structure: Any capability-based data structure (directories, lists, etc.) is built from nodes.
Domains (processes):
A domain is EROS’s equivalent of a process. It consists of:
- A domain root node with specific slots for:
- Slot 0-15: general-purpose key registers (the process’s capability table)
- Address space key (points to the root of the address space node tree)
- Schedule key (determines CPU time allocation)
- Brand key (identity for authentication)
- Other control keys
- The domain’s register state (general-purpose registers, IP, SP, flags)
- A state (running, waiting, available)
The entire domain state is captured during checkpoint because it’s all stored in persistent nodes and pages.
1.5 The Keeper Mechanism
Each domain has a keeper key – a capability to another domain that acts as its fault handler. When a domain faults (page fault, capability fault, exception), the kernel invokes the keeper:
- The faulting domain is suspended.
- The kernel sends a message to the keeper describing the fault.
- The keeper can inspect and modify the faulting domain’s state (via the domain key), fix the fault (e.g., map a page, supply a capability), and restart it.
This is EROS’s equivalent of signal handlers or exception ports, but capability-mediated and fully general. Keepers enable:
- Demand paging (the space bank keeper maps pages on fault)
- Capability interposition (a keeper can wrap/restrict capabilities)
- Process supervision (restart on crash)
1.6 Capability Revocation
Capability revocation – the ability to invalidate all copies of a capability – is one of the hardest problems in capability systems. EROS solves it with forwarder keys (called “sensory keys” in some descriptions):
How forwarders work:
- Instead of giving a client a direct key to a resource, the server creates a forwarder node.
- The forwarder contains a key to the real resource in one of its slots.
- The client receives a key to the forwarder, not the resource.
- When the client invokes the forwarder key, the kernel transparently redirects to the real resource.
- To revoke: the server rescinds the forwarder (sets a bit on the forwarder node). All outstanding forwarder keys become void keys. Invocations fail immediately.
Properties:
- Revocation is O(1) – flip a bit on the forwarder node. No need to scan all processes for copies.
- Revocation is transitive – if the revoked key was used to derive other keys (via further forwarders), those are also invalidated.
- The client cannot distinguish a forwarder key from a direct key (the kernel handles the indirection transparently).
- Revocation is immediate and irrevocable.
Space banks and revocation:
EROS uses space banks (inspired by KeyKOS) to manage resource allocation. A space bank is a capability that allocates pages and nodes. When a space bank is destroyed, ALL objects allocated from it are reclaimed. This provides bulk revocation of an entire subsystem.
1.7 Confinement
EROS provides a formally verified confinement mechanism. A confined subsystem cannot leak information to the outside world except through channels explicitly provided to it. Shapiro and Weber (IEEE S&P 2000) proved that EROS can construct a confined subsystem using:
- A constructor creates the confined process.
- The confined process receives ONLY the capabilities explicitly granted to it. It has no ambient authority, no access to timers (to prevent timing channels), and no access to storage (to prevent storage channels).
- The constructor verifies that no covert channels exist in the granted capability set.
This is relevant to capOS’s capability model: the same structural properties that make EROS confinement possible (no ambient authority, capabilities as the only access mechanism) are present in capOS’s design.
2. CapROS
2.1 Relationship to EROS
CapROS (Capability-based Reliable Operating System) is the direct successor to EROS. It was started by Charles Landau (who also worked on KeyKOS) and continues development based on the EROS codebase. CapROS is essentially “EROS in production” – the same architecture with engineering improvements.
2.2 Improvements Over EROS
Practical engineering focus:
- EROS was a research system; CapROS aims to be deployable.
- CapROS added support for modern hardware (PCI, USB, networking).
- Improved build system and development toolchain.
Persistence improvements:
- CapROS refined the checkpoint mechanism for better performance with modern disk characteristics (SSDs change the cost model significantly – random writes are cheap, so the checkpoint layout can be optimized differently than for spinning disks).
- Added support for larger persistent object spaces.
- Improved crash recovery speed.
Device driver model:
- CapROS runs device drivers as user-space processes (like EROS), each receiving only the device capabilities they need.
- A device driver receives: device register keys (MMIO access), interrupt keys (to receive interrupts), and DMA buffer keys.
- The driver CANNOT access other devices, other processes’ memory, or arbitrary I/O ports. It is confined to its specific device.
- This is directly analogous to capOS’s planned device capability model (see the networking and cloud deployment proposals).
Linux compatibility layer:
- CapROS includes a partial Linux kernel compatibility layer that allows some Linux device drivers to be compiled and run as CapROS user-space drivers. This pragmatically addresses the “driver availability” problem without compromising the capability model.
2.3 Current Status
CapROS development continued into the 2010s but has been relatively quiet. The codebase exists and runs on real x86 hardware. It is not widely deployed and remains primarily a research/demonstration system. The key contribution is demonstrating that the EROS/KeyKOS persistent capability model is viable on modern hardware and can support real device drivers and applications.
2.4 Device Drivers and Hardware Access
CapROS’s device driver isolation is worth examining in detail because capOS faces the same design decisions:
Device capability model:
Kernel
│
├── DeviceManager capability
│ │
│ ├── grants DeviceMMIO(base, size) to driver
│ ├── grants InterruptCap(irq_number) to driver
│ └── grants DMAPool(phys_range) to driver
│
└── Driver process
│
├── uses DeviceMMIO to read/write registers
├── uses InterruptCap to wait for interrupts
├── uses DMAPool to allocate DMA-safe buffers
└── exports higher-level capability (e.g., NIC, Block)
The driver has no way to access memory outside its granted ranges. A buggy NIC driver cannot corrupt disk I/O or access other processes’ pages.
3. Coyotos
3.1 Design Philosophy
Coyotos was Jonathan Shapiro’s next-generation project after EROS, started around 2004. Where EROS was an implementation of the KeyKOS model in C, Coyotos aimed to be a formally verifiable capability OS from the ground up.
Key differences from EROS:
- Verification-oriented design: Every kernel mechanism was designed to be amenable to formal verification. If a feature couldn’t be verified, it was redesigned or removed.
- BitC language: A new programming language (BitC) was designed specifically for writing verified systems software.
- Simplified object model: Coyotos reduced the number of primitive object types compared to EROS, making the verification target smaller.
- No inline assembly in the verified core: The verified kernel core was to be written entirely in BitC, with a thin hardware abstraction layer underneath.
3.2 BitC Language
BitC was an ambitious attempt to create a language suitable for both systems programming and formal verification:
Design goals:
- Type safety: Sound type system that prevents memory errors at compile time.
- Low-level control: Direct memory layout control, no garbage collector, suitable for kernel code.
- Formal reasoning: Type system designed so that proofs about programs could be mechanically checked.
- Mutability control: Explicit distinction between mutable and immutable references (predating Rust’s borrow checker by several years).
Relationship to capability verification:
The key insight was that if the kernel is written in a language with a sound type system, and capabilities are represented as typed references in that language, then many capability safety properties (no forgery, no amplification) follow from type safety rather than requiring separate proofs.
Specifically:
- Capabilities are opaque typed references – the type system prevents construction of capabilities from raw integers.
- The lack of arbitrary pointer arithmetic prevents capability forgery.
- Type-based access control means a read-only capability reference cannot be cast to a read-write one.
Outcome:
BitC was never completed. The language design proved extremely difficult – combining low-level systems programming with formal verification requirements created unsolvable tensions in the type system. Shapiro eventually acknowledged that the BitC approach was overambitious and shelved the project. (Rust, which appeared later, solved many of the same problems with a different approach – borrowing and lifetimes rather than full dependent types.)
3.3 Formal Verification Approach
Coyotos aimed to verify several key properties:
- Capability safety: No process can forge, modify, or amplify a capability. This was to be proved as a consequence of BitC’s type safety.
- Confinement: A confined subsystem cannot leak information except through authorized channels. EROS proved this informally; Coyotos aimed for machine-checked proofs.
- Authority propagation: Formal model of how authority flows through the capability graph, allowing static analysis of security policies.
- Memory safety: The kernel never accesses memory it shouldn’t, never double-frees, never uses after free. Type safety + linear types in BitC were intended to guarantee this.
The verification approach influenced later work on seL4, which successfully achieved formal verification of a capability microkernel (though in C with Isabelle/HOL proofs, not in a verification-oriented language).
3.4 Coyotos Memory Model
Coyotos simplified the EROS memory model while retaining persistence:
Objects:
- Pages: 4KB data pages (same as EROS).
- CapPages: Pages that hold capabilities instead of data. This replaced EROS’s fixed-size nodes with variable-size capability containers.
- GPTs (Guarded Page Tables): A unified abstraction for address space construction. Instead of EROS’s separate node trees for address spaces, Coyotos uses GPTs that combine guard bits (for sparse address space construction, similar to Patricia trees) with page table semantics.
- Processes: Similar to EROS domains but with a cleaner structure.
- Endpoints: IPC communication endpoints (similar to L4 endpoints, replacing EROS’s direct domain-to-domain calls).
GPTs (Guarded Page Tables):
This was Coyotos’s most innovative memory model contribution. A GPT node has:
- A guard value and guard length (for address space compression).
- Multiple capability slots pointing to sub-GPTs or pages.
- Hardware-independent address space description that the kernel translates to actual page tables on TLB miss.
The guard mechanism allows sparse address spaces without allocating intermediate page table levels. For example, a process that uses only two memory regions at addresses 0x1000 and 0x7FFF_F000 needs only a few GPT nodes, not a full 4-level page table tree.
Persistence:
Coyotos retained EROS’s checkpoint-based persistence but with a cleaner separation between the persistent object store and the in-memory cache. The simpler object model (fewer object types) made the checkpoint logic easier to verify.
3.5 Current Status
Coyotos was never completed. The BitC language proved too difficult, and Shapiro moved on to other work. However, Coyotos’s design documents and specifications remain valuable as a carefully reasoned evolution of the EROS model. The key ideas (GPTs, endpoint-based IPC, verification-oriented design) influenced other systems work.
4. Single-Level Store: Deep Dive
4.1 The Core Concept
The single-level store unifies two traditionally separate abstractions:
| Traditional OS | Single-Level Store |
|---|---|
| Virtual memory (RAM, volatile) | Unified persistent object space |
| File system (disk, persistent) | Same unified space |
| mmap (bridge between the two) | No bridge needed |
| Serialization (convert objects to bytes for storage) | Objects are always in storable form |
| Crash recovery (fsck, journal replay) | Checkpoint restore |
In a single-level store, the programmer never thinks about persistence. Objects are created, modified, and eventually garbage collected. The system ensures they survive power failure without any explicit save operation.
4.2 Implementation in EROS
EROS’s single-level store works as follows:
Object storage on disk:
- The disk is divided into two regions: the object store and the checkpoint log.
- The object store holds the canonical copy of all objects (pages and nodes), indexed by OID.
- The checkpoint log holds the most recently checkpointed versions of modified objects.
Object lifecycle:
- An object is created (allocated from a space bank). It receives a unique OID.
- The object exists in the in-memory object cache. It may be modified arbitrarily.
- During checkpoint, if the object is dirty, its current state is written to the checkpoint log.
- After the checkpoint commits, the logged version may be migrated to the object store (or left in the log until the next checkpoint).
- If the object is evicted from memory (memory pressure), it can be demand-paged back from disk.
Demand paging:
When a process accesses a virtual address that isn’t currently in physical memory:
- Page fault occurs.
- The kernel looks up the OID for that virtual page (by walking the address space capability tree).
- If the object is on disk, the kernel reads it into the object cache.
- The page is mapped into the process’s address space.
- The process continues, unaware that I/O occurred.
This is similar to demand paging in a conventional OS, but with a critical difference: the “backing store” is the persistent object store, not a swap partition. There is no separate swap space.
4.3 Performance Implications
Advantages:
- No serialization overhead for persistence. Objects are stored in their in-memory format.
- No double-buffering. A conventional OS may have a page in both the page cache and a file buffer; EROS has one copy.
- Checkpoint cost is proportional to mutation rate, not data size.
- Recovery is instantaneous – resume from last checkpoint, no log replay.
Disadvantages:
- Checkpoint pause: Even with copy-on-write, there is a brief pause to snapshot the system state. KeyKOS/EROS measured this at milliseconds, but it can grow with the number of dirty pages.
- Write amplification: Every modified page must be written to the checkpoint log, even if only one byte changed. This is worse than a log-structured filesystem that can coalesce small writes.
- Memory pressure: The object cache competes with application working sets. Under heavy memory pressure, the system may thrash between paging objects in and checkpointing them out.
- Large object stores: The OID-to-disk-location mapping must be kept in memory (or itself paged, adding complexity). For very large stores, this overhead grows.
- No partial persistence: You can’t choose to make some objects transient and others persistent. Everything is persistent. This wastes disk bandwidth on objects that don’t need persistence (temporary buffers, caches, etc.).
4.4 Relationship to Persistent Memory (PMEM/Optane)
Intel Optane (3D XPoint, now discontinued but conceptually important) and other persistent memory technologies provide byte-addressable storage that survives power loss. This is remarkably close to what EROS simulates in software:
| EROS Single-Level Store | PMEM Hardware |
|---|---|
| Software checkpoint to disk | Hardware persistence on every write |
| Object cache in DRAM | Data in persistent memory |
| Demand paging from disk | Direct load/store to persistent media |
| Crash = lose since last checkpoint | Crash = lose in-flight stores (cache lines) |
PMEM makes the single-level store cheaper:
- No checkpoint writes needed for objects stored in PMEM – they’re already persistent.
- No demand paging from disk – PMEM is directly addressable.
- Consistency requires cache line flush + fence (much cheaper than disk I/O).
But PMEM doesn’t eliminate the need for the store abstraction:
- PMEM capacity is limited (compared to SSDs/HDDs). The object store may still need to tier between PMEM and block storage.
- PMEM has higher latency than DRAM. The object cache still has value as a fast-path.
- Crash consistency with PMEM requires careful ordering of writes (cache line flushes). The checkpoint model actually simplifies this – you don’t need per-object crash consistency, just per-checkpoint consistency.
Relevance to capOS:
Even without PMEM hardware, understanding the single-level store model informs how capOS can design its persistence layer. The key insight is that separating “in-memory format” from “on-disk format” creates unnecessary complexity. Cap’n Proto’s zero-copy serialization already blurs this line – a capnp message in memory has the same byte layout as on disk.
5. Persistent Capabilities
5.1 How Persistent Capabilities Survive Restarts
In EROS/KeyKOS, capabilities survive restarts because they are part of the checkpointed state:
- A capability is stored as a key in a node slot.
- The key contains: (object type, OID, permissions, other metadata).
- During checkpoint, all nodes (including their key slots) are written to disk.
- On restart, nodes are restored. Keys reference objects by OID. Since objects are also restored, the key resolves to the same object.
The critical property: capabilities are named by the persistent identity of their target, not by a volatile address. A key says “page #47293” not “memory address 0x12345.” Since page #47293 is persistent, the key is meaningful across restarts.
5.2 Consistency Model
EROS guarantees checkpoint consistency: the entire system is restored to the state at the last committed checkpoint. This means:
- If process A sent a message to process B, and both the send and receive completed before the checkpoint, both see the message after restart.
- If the send completed but the receive didn’t (checkpoint happened between them), both are rolled back to before the send. The message is lost, but the system is consistent.
- There is no scenario where A thinks it sent a message but B never received it (or vice versa). The checkpoint captures a consistent global snapshot.
This is analogous to database transaction atomicity but applied to the entire system state.
5.3 Volatile State and Capabilities
Some capabilities reference inherently volatile state. EROS handles this through the object re-creation pattern:
Hardware devices:
- Device keys reference hardware registers that don’t survive reboot.
- On restart, the kernel re-initializes device state and re-creates device keys.
- Processes that held device keys get valid keys again (pointing to the re-initialized device), but the device state itself is reset.
- The process’s device driver is responsible for re-initializing the device to the desired state (this is application logic, not kernel logic).
Network connections:
- EROS doesn’t have a native networking stack in the kernel, so this is handled at the application level.
- A network service process re-establishes connections on restart.
- Clients that held capabilities to network endpoints would invoke them, and the network service would transparently reconnect.
- The capability abstraction hides the reconnection – the client’s code doesn’t change.
General pattern:
When a capability references state that can’t survive restart:
- The capability itself persists (it’s in a node slot, checkpointed).
- On restart, invoking the capability may trigger re-initialization.
- The keeper mechanism handles this: the target object’s keeper detects the stale state and re-initializes before completing the call.
- The client is unaware of the restart (or sees a transient error if re-initialization fails).
5.4 The Space Bank Model
Persistent capabilities create a garbage collection problem: when is it safe to reclaim a persistent object? EROS solves this with space banks:
- A space bank is a capability that allocates objects (pages and nodes).
- Every object is allocated from exactly one space bank.
- Space banks can be hierarchical (a bank allocates from a parent bank).
- Destroying a space bank reclaims ALL objects allocated from it.
This provides:
- Bulk deallocation: Terminate a subsystem by destroying its bank.
- Resource accounting: Each bank tracks how much space it has consumed.
- Revocation: Destroying a bank revokes all capabilities to objects allocated from it (the objects cease to exist).
The space bank model avoids the need for a global garbage collector scanning the capability graph. Instead, resource lifetimes are explicitly managed through the bank hierarchy.
6. Relevance to capOS
6.1 Cap’n Proto as Persistent Capability Format
EROS stores capabilities as (type, OID, permissions) tuples in fixed-size node slots. capOS can do something analogous but more naturally, because Cap’n Proto already provides a serialization format for structured data:
A persistent capability in capOS could be a capnp struct:
struct PersistentCapRef {
interfaceId @0 :UInt64; # which capability interface
objectId @1 :UInt64; # persistent object identity
permissions @2 :UInt32; # bitmask of allowed methods
epoch @3 :UInt64; # revocation epoch (see below)
}
Why this works well with Cap’n Proto:
- Zero-copy persistence: A capnp message in memory has the same byte layout as on disk. No serialization/deserialization step for persistence. This is the closest a modern system can get to EROS’s single-level store without hardware support.
- Schema evolution: Cap’n Proto’s backwards-compatible schema evolution means persistent capability formats can evolve without breaking existing stored capabilities.
- Cross-machine references: The same
PersistentCapRefcan reference a local or remote object. TheobjectIdcan include a machine/node identifier for distributed capabilities. - Type safety: The
interfaceIdfield provides runtime type checking that EROS’s keys lacked (EROS keys are untyped references; the type is determined by the target object, not the key).
Difference from EROS:
EROS capabilities are kernel objects – the kernel knows about every key and
mediates every invocation. In capOS, PersistentCapRef could be a
user-space construct – a serialized reference that is resolved by the
kernel (or a userspace capability manager) when invoked. This is a
deliberate trade-off: less kernel complexity, more flexibility, but the
kernel must validate references on use rather than at creation time.
6.2 Checkpoint/Restart Patterns for capOS
EROS’s checkpoint model provides several patterns capOS could adopt:
Pattern 1: Application-Level Checkpointing (Recommended as Phase 1)
This is what capOS’s storage proposal already describes: services serialize their own state to the Store capability. This is simpler than EROS’s transparent persistence but requires application cooperation.
Service state → capnp serialize → Store.put(data) → persistent hash
On restart: Store.get(hash) → capnp deserialize → restore state
Advantages over EROS transparent persistence:
- No kernel complexity for checkpointing.
- Services control what is persistent and what is transient.
- No “checkpoint pause” – services choose when to persist.
- Natural fit with Cap’n Proto (state is already capnp).
Disadvantages:
- Every service must implement save/restore logic.
- No automatic consistency across services (each saves independently).
- Programmer error can lead to inconsistent state after restart.
Pattern 2: Kernel-Assisted Checkpointing (Phase 2)
Add a Checkpoint capability that captures process state:
interface Checkpoint {
# Save the calling process's state (registers, memory, cap table)
save @0 () -> (handle :Data);
# Restore a previously saved state
restore @1 (handle :Data) -> ();
}
This is analogous to CRIU (Checkpoint/Restore in Userspace) on Linux but capability-mediated:
- The kernel captures the process’s address space, register state, and capability table.
- State is serialized as capnp messages and stored via the Store capability.
- Restore creates a new process from the saved state.
Advantages:
- Transparent to the application (no save/restore logic needed).
- Can capture the full capability graph of a process.
- Enables process migration between machines.
Disadvantages:
- Kernel complexity for state capture.
- Must handle capabilities that reference volatile state (open network connections, device handles).
- Memory overhead for copy-on-write snapshots.
Pattern 3: Consistent Multi-Process Checkpointing (Phase 3)
EROS’s global checkpoint extended to capOS:
- A
CheckpointCoordinatorservice initiates a distributed snapshot. - All participating services freeze, checkpoint their state, then resume.
- The coordinator records a consistent cut across all services.
- Recovery restores all services to the same consistent point.
This requires:
- A coordination protocol (similar to distributed database commit).
- Services must participate in the protocol (register with the coordinator, respond to freeze/checkpoint/resume signals).
- The coordinator must handle failures during the checkpoint itself.
This is the most complex option but provides the strongest consistency guarantees. It’s appropriate for capOS’s later stages when multi-service reliability matters.
6.3 Capability-Native Filesystem Design
EROS’s model and capOS’s Store proposal can be synthesized into a capability-native filesystem design:
Hybrid approach: Content-Addressed Store + Capability Metadata
capOS’s current Store proposal uses content-addressed storage (hash-based). This is good for immutable data but awkward for capability references (a capability’s target may change without the capability itself changing).
A better model, informed by EROS:
Persistent Object = (ObjectId, Version, CapnpData, CapSlots[])
Where:
ObjectIdis a persistent identity (like EROS’s OID).Versionis a monotonic counter (for optimistic concurrency).CapnpDatais the object’s data payload as a capnp message.CapSlots[]is a list of capability references embedded in the object (like EROS’s node slots).
This separates data from capability references, which is important because:
- Data can be content-addressed (deduplicated by hash).
- Capability references must be identity-addressed (two identical-looking references to different objects are different).
- Revocation operates on capability references, not data.
The Namespace as Directory
capOS’s Namespace capability is the capability-native equivalent of a
directory:
| Unix | EROS | capOS |
|---|---|---|
| Directory (inode + dentries) | Node with keys in slots | Namespace capability |
| Path traversal | Node tree walk | Namespace.resolve() chain |
| Permission bits | Key type + slot permissions | Capability attenuation |
| Hard links | Multiple keys to same object | Multiple refs to same hash |
| Symbolic links | Forwarder keys | Redirect capabilities |
Journaling and Crash Consistency
EROS avoids journaling by using checkpoint-based consistency. capOS’s Store service needs its own consistency story:
Option A: Checkpoint-based (EROS-style)
- Store service maintains an in-memory cache of recent modifications.
- Periodically flushes a consistent snapshot to disk.
- On crash, recovers to last flush point.
- Simple but may lose recent writes.
Option B: Log-structured (modern)
- All writes go to an append-only log.
- A background compaction process builds indexed snapshots from the log.
- On crash, replay the log from the last snapshot.
- More complex but no data loss window.
Option C: Hybrid
- Capability metadata (the namespace bindings) uses a write-ahead log for crash consistency.
- Object data (capnp blobs in the content-addressed store) uses checkpoint-based consistency (losing a few blobs is tolerable; losing a namespace binding is not).
Option C is recommended for capOS: it provides strong consistency for the critical metadata while keeping the data path simple.
6.4 Transparent vs Explicit Persistence: Tradeoffs
| Aspect | EROS Transparent | capOS Explicit | Hybrid |
|---|---|---|---|
| Application complexity | None (automatic) | High (must implement save/restore) | Medium (opt-in transparency) |
| Kernel complexity | Very high (checkpoint, COW, object store) | Low (just IPC and memory) | Medium (checkpoint capability) |
| Consistency | Strong (global checkpoint) | Weak (per-service) | Medium (coordinator) |
| Control | None (everything persists) | Full (choose what to save) | Selective |
| Performance | Checkpoint pauses | No pauses, explicit I/O cost | Configurable |
| Volatile state | Keeper mechanism handles | Service handles reconnection | Annotated capabilities |
| Debuggability | Hard (system is a black box) | Easy (state is explicit capnp) | Medium |
| Cap’n Proto fit | Neutral | Excellent (state = capnp) | Good |
Recommendation for capOS:
Start with explicit persistence (Phase 1 in the storage proposal) because:
- It’s dramatically simpler to implement.
- Cap’n Proto makes serialization nearly free anyway.
- It gives services control over what is persistent.
- It aligns with capOS’s existing Store/Namespace design.
- The kernel stays simple.
Then add opt-in kernel-assisted checkpointing (like the Checkpoint capability described above) for services that want transparent persistence. This gives the benefits of EROS’s model without forcing it on everything.
Never implement EROS’s fully transparent global persistence – the kernel complexity is enormous, the debugging experience is poor, and modern systems (with fast SSDs and capnp zero-copy serialization) don’t need it. The explicit model with good tooling is strictly better for a research OS.
6.5 Capability Revocation in capOS
EROS’s forwarder key model translates directly to capOS:
Epoch-based revocation:
Each capability includes a revocation epoch. The kernel (or capability manager) maintains a per-object epoch counter. When a capability is invoked:
- Check that the capability’s epoch matches the object’s current epoch.
- If it doesn’t match, the capability has been revoked – return an error.
- To revoke all capabilities to an object, increment the object’s epoch.
This is O(1) revocation (increment a counter) with O(1) check per invocation (compare two integers). It’s simpler than EROS’s forwarder mechanism and fits naturally into a capnp-serialized capability reference:
struct CapRef {
objectId @0 :UInt64;
epoch @1 :UInt64; # revocation epoch
permissions @2 :UInt32; # method bitmask
interfaceId @3 :UInt64; # type of the capability
}
Space bank analog:
capOS can implement EROS’s space bank pattern using the Store:
- Each “bank” is a Namespace prefix in the Store.
- Objects allocated by a service are stored under its namespace.
- Destroying the service’s namespace revokes access to all its objects.
- Resource accounting is done by the Store service (track bytes per namespace).
6.6 Summary of Recommendations
| EROS/CapROS/Coyotos Concept | capOS Recommendation |
|---|---|
| Single-level store | Don’t implement (too complex for research OS). Use Cap’n Proto zero-copy as a lightweight equivalent. |
| Checkpoint/restart | Phase 1: application-level (explicit capnp save/restore). Phase 2: Checkpoint capability for opt-in transparent persistence. |
| Persistent capabilities | Use capnp PersistentCapRef struct with objectId + epoch. Store capability graph in the Store service. |
| Capability revocation | Epoch-based revocation (increment counter, check on invocation). Simpler than EROS forwarders, same O(1) cost. |
| Space banks | Map to Store namespaces. Destroying a namespace reclaims all objects. |
| Keeper/fault handler | Map to capOS’s supervisor mechanism (service-architecture proposal). Supervisor receives fault notifications and can restart/repair. |
| GPTs (Coyotos) | Not needed – capOS uses hardware page tables directly. The sparse address-space idea remains relevant for future SharedBuffer/AddressRegion work beyond the current VirtualMemory cap. |
| Confinement | capOS already has the structural prerequisites (no ambient authority). Formal confinement proofs are a future research direction. |
| Device isolation | Already planned in capOS (device capabilities with MMIO/interrupt/DMA grants). CapROS validates this approach works in practice. |
Key References
- Shapiro, J. S., Smith, J. M., Farber, D. J. “EROS: A Fast Capability System.” Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP), 1999.
- Shapiro, J. S. “EROS: A Capability System.” PhD dissertation, University of Pennsylvania, 1999.
- Shapiro, J. S. & Weber, S. “Verifying the EROS Confinement Mechanism.” IEEE Symposium on Security and Privacy, 2000.
- Hardy, N. “The Confused Deputy.” ACM SIGOPS Operating Systems Review, 1988. (Motivates capability-based access control.)
- Hardy, N. “KeyKOS Architecture.” Operating Systems Review, 1985.
- Landau, C. R. “The Checkpoint Mechanism in KeyKOS.” Proceedings of the Second International Workshop on Object Orientation in Operating Systems, 1992.
- Shapiro, J. S. et al. “Coyotos Microkernel Specification.” Technical report, Johns Hopkins University, 2004-2008.
- Shapiro, J. S. et al. “BitC Language Specification.” Technical report, Johns Hopkins University, 2004-2008.
- Dennis, J. B. & Van Horn, E. C. “Programming Semantics for Multiprogrammed Computations.” Communications of the ACM, 1966. (Original capability concept.)
- Levy, H. M. “Capability-Based Computer Systems.” Digital Press, 1984. (Comprehensive survey of capability systems including CAP, Hydra, iAPX 432, StarOS.)