Proposal: Capability-Based Service Architecture

How capOS processes receive authority, compose into services, and expose layered capabilities — without a service manager daemon.

Problem

Traditional OSes grant processes ambient authority (file system, network, IPC namespaces) and then restrict it via sandboxing (seccomp, namespaces, AppArmor). Service managers like systemd handle dependencies, lifecycle, and resource limits through a central daemon with a massive configuration surface.

capOS inverts this: processes start with zero authority and receive only the capabilities they need. The capability graph implicitly encodes service dependencies, resource limits, and access control. No central daemon required.

Process Startup Model

A process receives its entire authority as a set of named capabilities at spawn time. There is no ambient authority to fall back on — if a capability wasn’t granted, the operation is impossible.

The child process sees its granted capabilities by name. It cannot discover or request capabilities it wasn’t given.

Capability Layering

Each process consumes lower-level capabilities and exports higher-level ones. Authority narrows at every layer:

Kernel
  │
  ├─ Nic cap (raw frame send/receive for one device)
  ├─ Timer cap (monotonic clock)
  ├─ DeviceMmio cap (one device's BAR regions)
  └─ Interrupt cap (one IRQ line)
       │
       v
NIC Driver Process
  │
  └─ Nic cap ──> Network Stack Process
                   │
                   ├─ TcpSocket cap (one connection)
                   ├─ UdpSocket cap (one socket)
                   └─ NetworkManager cap (create sockets)
                        │
                        v
                   HTTP Service Process
                     │
                     ├─ Fetch cap (any URL)
                     │    │
                     │    v
                     │  Trusted Process (holds Fetch, mints scoped caps)
                     │
                     └─ HttpEndpoint cap (one origin)
                          │
                          v
                     Application Process

The application at the bottom holds an HttpEndpoint cap scoped to a single origin. It cannot make raw TCP connections, send arbitrary packets, or touch any device. The capability is the security policy.

HTTP Capabilities

Two levels of HTTP capability: Fetch (general) and HttpEndpoint (scoped). HttpEndpoint is implemented by a process that holds a Fetch cap and restricts it.

Fetch

Unrestricted HTTP access — equivalent to the browser Fetch API. The holder can make requests to any URL. This is the base capability that HTTP service processes use internally.

interface Fetch {
    # General-purpose HTTP request to any URL.
    request @0 (url :Text, method :Text, headers :List(Header), body :Data)
        -> (status :UInt16, headers :List(Header), body :Data);
}

struct Header {
    name @0 :Text;
    value @1 :Text;
}

Fetch is powerful — granting it is roughly equivalent to granting arbitrary outbound network access. It should only be held by service processes that need to make requests on behalf of others, not by application code directly.

HttpEndpoint

A restricted view of Fetch, scoped to a single origin. The holder can only make requests within the bounds encoded in the capability.

interface HttpEndpoint {
    # Request scoped to this endpoint's origin.
    # Path is relative (e.g., "/v1/users").
    request @0 (method :Text, path :Text, headers :List(Header), body :Data)
        -> (status :UInt16, headers :List(Header), body :Data);
}

Note: same request() signature as Fetch, but path instead of url. The origin is implicit — bound into the capability at mint time.

Attenuation

A process holding Fetch mints HttpEndpoint caps by narrowing authority. The core restriction is always origin — Fetch can reach any URL, HttpEndpoint is locked to one host. Additional constraints (path prefixes, method restrictions, rate limits) are possible but are userspace policy details, not OS-level concerns.

This is the standard object-capability attenuation pattern: same interface, less authority. The application code is identical whether it holds a broad or narrow HttpEndpoint.

Boot and Initialization Sequence

The kernel doesn’t know about services. It boots, creates a handful of kernel-provided caps, and spawns exactly one process: init. Everything else is init’s responsibility.

Current State vs Target State

The implementation has crossed the single-init startup milestone and the 15.4 schema split. SystemManifest now carries schemaVersion, binaries, initConfig, and kernelParams. The Cap’n Proto schema no longer exposes ServiceEntry, ServiceCapSource, CapRef, exports, or restart policy as kernel-consumed fields. Those service-graph concepts remain as Rust parsing types inside capos-config because the focused init executor still interprets initConfig.services.

Each process now also carries an immutable session context produced at spawn time by kernel/src/session_context.rs; default inheritance comes from the parent’s session context, and a broker can select a child session through the AuthorityBroker/UserSession path. This invocation context is the basis for session-scoped audit attribution and identity-policy enforcement; see User Identity and Policy and make run-session-context for the one-session-per-process proof.

Current manifests put the first process description at initConfig.init. The default system.cue manifest now boots the separate init binary with BootPackage and ProcessSpawner; that init process reads initConfig.services and starts the shell, remote-session CapSet gateway, chat server, and resident demo services. Focused shell-led manifests such as system-smoke.cue and system-shell.cue still boot capos-shell as the lone init process for narrow login/shell proofs. Focused init-executor manifests such as system-spawn.cue, system-chat.cue, and system-adventure.cue boot the separate init binary with BootPackage and ProcessSpawner; that init process reads initConfig.services and resolves the remaining service graph through ProcessSpawner. Other focused single-service or harness manifests still boot a demo/service binary as the init process for narrow proofs. The kernel validates only the kernel-owned boot boundary: schema version, binaries, kernelParams, initConfig.init.binary, and kernel-sourced initConfig.init.caps.

Current Bootstrap Ownership Inventory

As of 2026-05-13, the repo is in the schema-split init-owned startup state:

schema/capos.capnp defines SystemManifest as schemaVersion, binaries, initConfig, and kernelParams. Service graph fields are not Cap’n Proto schema fields.
capos-config/src/manifest.rs still defines ServiceEntry, CapRef, CapSource::Kernel, CapSource::Service, and RestartPolicy as internal Rust types for parsing initConfig.services.
tools/mkmanifest still embeds every declared binary into the manifest and validates the full init-owned graph before writing manifest.bin.
capos-config/src/validation.rs separates kernel bootstrap validation from init graph validation. Kernel bootstrap validation covers binary names, initConfig.init.binary, init kernel cap sources, and kernelParams. Full graph validation covers initConfig.services for mkmanifest and init’s metadata-only ManifestBootstrapPlan path.
kernel/src/main.rs::run_init reads the Limine manifest module, validates the kernel-owned bootstrap contract, configures serial policy from kernelParams, and loads only initConfig.init.binary.
kernel/src/cap/mod.rs::create_boot_service_caps builds only initConfig.init.caps. Those caps are kernel-sourced by type, so the kernel has no CapSource::Service branch.
The init cap bundle is currently described by initConfig.init.caps. In the default system.cue manifest this grants the separate init binary the bootstrap caps it needs to read BootPackage and spawn the service graph. In focused shell-led manifests such as system-smoke.cue, this still grants capos-shell terminal, credential, session, audit, and broker capabilities directly. In focused single-service or harness manifests, initConfig.init.caps grants only the capabilities the harness itself needs.
BootPackage exposes the full serialized manifest bytes to init. That path is live for default and focused init-executor manifests. Focused shell-led manifests do not grant BootPackage to capos-shell.
ProcessSpawner owns the embedded binary set. It receives the boot manifest bytes so delegated ProcessSpawner grants can preserve that same boot package context; child BootPackage caps are not minted from SpawnGrantSource::Kernel. ProcessSpawner.createPipe(bufferBytes) mints a bounded SPSC kernel Pipe capability used by the POSIX adapter Phase P1.3 recording-shim fork-for-exec path; see POSIX Adapter §Phase P1.3 and Userspace Binaries Part 4.
ProcessSpawner.spawn resolves SpawnGrantSource::Kernel for the bounded manager-issued DDF authority surfaces (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog) through the matching grant-source records in kernel/src/cap/devicemmio_grant_source.rs, kernel/src/cap/dmapool_grant_source.rs, and their interrupt/audit peers. Each grant attaches a fresh manager-owned record, validates owner/quiesce/ scrub state for DMA-side caps, and returns a child-local handle without sharing the parent’s owner object. See device-driver-foundation.md Task 5 for the bounded-authority scope and the focused make run-devicemmio-grant, make run-dmapool-grant, make run-interrupt-grant, and make run-hardware-audit smokes.
init/src/main.rs is the focused BootPackage executor. When that binary is the init process, it reads the BootPackage manifest, builds a ManifestBootstrapPlan, validates it again, discovers its own kernel grants from initConfig.init.caps plus the CapSet, preflights the initConfig.services graph, resolves kernel and service cap sources, records exports, spawns children through ProcessSpawner, and waits on their ProcessHandles.
system.cue, system-smoke.cue, system-spawn.cue, system-chat.cue, system-adventure.cue, and the other focused manifests now express their first-process bundle under initConfig.init and any child topology under initConfig.services.

The practical cleanup boundary is therefore not “move service startup to init”; that already happened. The current cleanup target is narrower: the kernel no longer understands the service graph as a bootstrap authority structure. The remaining future cleanup is to stop letting focused harnesses choose arbitrary init binaries and direct kernel cap bundles, then move to one fixed generic-init ABI.

Narrowed Transitional Contract

The current schema is schemaVersion, binaries, initConfig, and kernelParams. The narrowed kernel contract is:

The kernel validates schemaVersion, parses kernelParams for kernel-consumed boot policy, and configures serial policy.
The kernel resolves only initConfig.init.binary against binaries and loads only that ELF.
The kernel may interpret initConfig.init.caps only as the bootstrap cap bundle for the single first process. Those caps must be kernel-sourced; a service-sourced cap in initConfig.init.caps is invalid because no non-init service exists at kernel handoff time.
initConfig.services[*], their caps, exports, restart, and any CapSource::Service references are init-owned configuration while the transitional Rust parser exists. mkmanifest and init continue validating them for smoke coverage, but kernel bootstrap does not run the multi-service graph validator or a service export resolver.
Focused harness manifests that intentionally boot a demo/service binary as init stay valid during this slice. Their harness-specific caps are still described by initConfig.init.caps until those smokes are migrated behind a generic init-owned executor config.

Kernel bootstrap implements this contract with a first-service cap-table builder. That builder covers only implemented kernel sources used by current initConfig.init.caps lists. That current first-service surface is wider than the eventual generic-init minimum: the default init-owned path needs Console, TerminalSession, CredentialStore, SessionManager, AuditLog, AuthorityBroker, BootPackage, ProcessSpawner, listener, launcher, and chat endpoint authority so it can launch the current service graph; focused shell-led paths still need TerminalSession, CredentialStore, SessionManager, AuditLog, and AuthorityBroker directly; focused harnesses need their own direct kernel caps. Cross-service export lookup, service-source attenuation, and non-init cap-resolution policy stay in init/src/main.rs for the focused BootPackage-executor manifests.

Target Boot Package Contract

After the harness migration, SystemManifest should keep the same outer shape but initConfig.init should stop being a per-manifest kernel bootstrap bundle. At that point:

ServiceEntry, CapRef, CapSource::Service, service exports, and restart policy remain ordinary data inside initConfig, interpreted and validated by init or a supervisor service.
Kernel validation is limited to the schema version, kernel parameters, boot-package integrity/measurement policy, and enough binary metadata to load the one init image.
The first process is the generic init/supervisor, not a demo harness or shell. Shell-led and focused single-service proofs should become init-owned configurations rather than alternate kernel bootstrap contracts.
The currently implemented generic-init bundle starts with Console, BootPackage, and ProcessSpawner. This is a transitional minimum, not the durable generic-init contract and not the full transitional initConfig.init.caps surface. The architecture-level target replaces the boot-wide binary registry with read-only bootstrap metadata plus root executable-image sealing and process-construction authority. It also includes Timer, DeviceManager, FrameAllocator, and per-process VirtualMemory once those authorities are ready to be part of init’s stable bootstrap ABI. Until then, FrameAllocator, VirtualMemory, and Endpoint grants for child processes remain minted through ProcessSpawner spawn grants.

The target model removes the kernel-side service graph entirely. The manifest stops being a kernel authority graph and becomes bootstrap input delivered to init:

Canonical init remains kernel-embedded. If hardware or recovery constraints require more code before the installed generation can be mounted, that code is a fixed, measured recovery bundle rather than an enumerable application binary namespace.
Init’s bootstrap config identifies how to discover, authenticate, and activate the installed system generation. Service-role topology, launch plans, and rollback policy come from that generation after validation.
Kernel boot parameters (memory limits, feature flags) remain kernel-consumed inputs rather than service-launch authority.

The current embedded/ISO binary list and ProcessSpawner(binaryName, grants) remain transitional compatibility mechanisms. The durable executable and launch boundary is specified by Executable Images and Service Launch Authority: exact immutable executable-image capabilities feed role-specific launch plans, while package selection and service lifecycle remain userspace policy.

The kernel spawns exactly one userspace process (init) with a fixed cap bundle:

Console — kernel serial wrapper (may be replaced later by a userspace log service, with init retaining a direct console cap for emergency use).
root executable-image sealing and process-construction authority — currently represented by the broad ProcessSpawner compatibility constructor; init retains the root mechanism while delegating only role-specific launch plans;
FrameAllocator — physical frame authority for init’s own allocations.
VirtualMemory — per-process address-space authority for init.
DeviceManager — enumerate/claim devices; init delegates device-specific slices to drivers.
Timer — monotonic clock.
read-only bootstrap metadata — currently represented by BootPackage; the durable form identifies installed-generation discovery and recovery inputs, not an enumerable application binary namespace.

Everything else — drivers, net-stack, filesystems, supervisors, apps — is launched from the authenticated installed generation through role-specific launch authority. No manifest ServiceEntry, cross-service CapRef, manifest export, or boot-wide application binary selector remains in the durable model.

Pre-Init Boundary After Stage 6

Rule of thumb: no userspace service runs before init. The kernel’s job is primitive cap synthesis and a single-process handoff; init’s job is the whole service graph. Concretely, after Stage 6:

Stays in kernel pre-init: memory map ingest, frame allocator, heap, paging, GDT/IDT/TSS, serial for kernel diagnostics, scheduler, ring dispatch, kernel-cap CapObject impls, ELF loading for init, boot package measurement (if attested boot is added).
Stays in bootstrap input: init/recovery measurement metadata, init’s installed-generation discovery config, and kernel boot parameters. The current binaries list remains only until ordinary services and focused proofs migrate to executable-image launch authority.
Moves to init: service topology, cross-service cap wiring, attenuation, restart policies, dynamic spawn, cap export/import, supervision trees. Anything a service manager would do.
Moves to init or later services: logging policy, config store, secrets, filesystem mounts, network configuration, device binding.

Edge cases that might look like they want a pre-init service but don’t:

Early crash / panic handling. Kernel-side panic handler, no service needed.
Recovery shell. If init fails to reach a healthy state within a timeout, recovery comes from the fixed measured recovery floor, not the normal system package namespace. Still just one userspace process at a time before the supervisor loop.
Attested/measured boot. The kernel measures canonical init and any fixed recovery bundle before handing sealed measurements to init. A userspace package authority authenticates the installed generation; a measurement agent, if any, runs as a normal service with access to the sealed boot measurements.
Early-boot console. Kernel owns serial and exposes Console to init. A userspace log service can layer on top later; it is not pre-init.

Legacy Manifest Fields After Stage 6

ServiceEntry.caps, CapSource::Service, and ServiceEntry.exports are transitional init configuration, not kernel schema. The 15.4 schema split deleted them from schema/capos.capnp, collapsed the service graph into initConfig: CueValue, and kept kernel bootstrap on the first-service cap-table builder. The remaining cleanup is to make that first-service bundle fixed rather than manifest-selected:

Move shell-led and focused harness proofs behind an init-owned executor config instead of booting their binaries directly as init.
Embed or otherwise pin the generic init image as the only kernel-loaded userspace image. Partially landed (2026-05-25 23:26 UTC): the init image is embedded and loaded from kernel::boot::INIT_ELF whenever init.binary == "init" (see “Init Binary Embedding”). It is not yet the only kernel-loaded image — until step 1 moves the focused/shell proofs behind an init-owned executor, non-"init" PID-1 selectors are still kernel-loaded from binaries.
Replace per-manifest initConfig.init.caps with the fixed bootstrap cap bundle described above. Keep BootPackage only as the compatibility carrier until its general binary list is replaced by bootstrap metadata and the recovery bundle.
Keep initConfig.services as ordinary init/supervisor configuration until a later libcapos or supervisor API gives it a more concrete format.

The re-export restriction added in capos-config::validate_manifest_graph (service A exports cap sourced from B.ep) becomes moot at that point because there are no kernel-owned manifest exports at all. It stays as defensive validation for initConfig.services while the transitional init-owned executor exists.

Init Binary Embedding

Status: landed 2026-05-25 23:26 UTC as a hybrid keyed on the reserved init selector (see below). Init is part of the kernel’s bootstrap contract, not a configuration choice: the cap bundle handed to init is a kernel ABI, the _start(ring, pid, …) entry shape is a kernel ABI, and a version-mismatched init is a footgun with no payoff in a single-init research OS. So the init ELF ships inside the kernel binary via include_bytes!, not as a separate manifest entry or Limine module.

Shape (as landed):

init/ stays a standalone crate with its own linker script and code model (user-space base 0x200000, static relocation model, 4 KiB alignment). Not a workspace member; different build flags than the kernel.
kernel/build.rs reads the prebuilt init/ artifact (the Makefile passes CAPOS_INIT_ELF and orders init before the kernel; a conventional-path fallback covers a bare cargo build after init is built) and emits an include_bytes!("…") into a kernel::boot::INIT_ELF: &[u8] static. Driving init’s build from build.rs was rejected to avoid duplicating its custom target/code-model flags; failing closed on a missing artifact is the chosen behavior.
initConfig.init.binary is a generic “which binary is PID 1” selector, so embedding is keyed on the reserved name capos_config::RESERVED_INIT_BINARY_NAME ("init"). When init.binary == "init", kernel bootstrap parses INIT_ELF through the same capos_lib::elf path used for service binaries, creates the init address space via AddressSpace::new_user(), loads segments, populates the cap bundle (including BootPackage), and jumps — no Limine module lookup and no binaries resolution for that identity. When init.binary names any other binary (the shell on run-smoke, the ~70 focused test-as-PID-1 manifests), PID 1 still resolves from SystemManifest.binaries exactly as before.
The reserved name "init" must not appear in SystemManifest.binaries: manifest validation (capos-config and mkmanifest) rejects it, since the kernel owns the init image. Real-init manifests drop their init entry; their binaries list is services-only.
The embedded image is the canonical init binary, so init’s own child spawns that reference init by name (e.g. system-spawn.cue’s spawn-hardening fixtures) still resolve: when init is embedded, run_init injects the embedded bytes into the ProcessSpawner binary set under the reserved name (the BootPackage cap serves only the serialized manifest bytes, which never carry the reserved entry). This keeps the spawnable set identical to the pre-embedding state without init re-entering the serialized manifest. Service binaries remain distinct BootPackage blobs.
Measured-boot attestation (if added) covers the kernel ELF, which transitively covers init’s bytes. Service binaries are hashed separately by the kernel before handing BootPackage to init.

What this does not change:

Init still runs in Ring 3 with its own page tables; embedding is byte packaging, not privilege merging.
Init is still ELF-parsed at boot — the same loader and W^X enforcement apply. The only thing different is where the bytes came from.
Service binaries (everything spawned after init) stay in the boot package as distinct blobs, exposed to init via BootPackage. They are not linked into the kernel; their lifecycle is independent of the kernel’s.

What option was rejected: fully linking init into the kernel crate (shared compilation unit, shared text). That collapses the kernel/user build boundary, couples linker scripts and code models, and puts init’s panics/UB inside the kernel’s compilation context. The process-isolation boundary survives that arrangement — but the build-time separation that makes the boundary trustworthy does not. include_bytes! preserves the separation; static linking destroys it.

flowchart TD
    Kernel[Kernel boot] --> Init[Spawn canonical init with bootstrap caps]
    Init --> Generation[Authenticate one SystemGeneration root]
    Generation --> Plans[Derive exact service revisions and launch plans]
    Plans --> Drivers[Launch hardware drivers through typed binding slots]
    Drivers --> Network[Launch network services after driver readiness]
    Network --> Platform[Launch higher-level platform services]
    Platform --> Apps[Delegate role-specific application launch plans]

The Init Process in Detail

Init is a regular userspace process with privileged caps. It retains root executable-image sealing and process-construction authority, plus DeviceManager, inside its trusted launch runtime. It does not delegate that constructor. After authenticating one complete SystemGeneration, it derives role-specific launch plans and delegates those plans or their controller facets to child supervisors.

The following code is conceptual target API, not implemented Rust or frozen schema. Its typed binding structures stand for launch-plan-declared dynamic slots; callers do not supply binary names or arbitrary grant lists.

fn main(caps: CapSet) {
    let launch_runtime = LaunchRuntime::new(
        caps.get::<ExecutableImageFactory>("image-factory"),
        caps.get::<ProcessConstructor>("process-constructor"),
    );
    let devices = caps.get::<DeviceManager>("devices");
    let generation = authenticate_active_generation(&caps)?;
    let services = launch_runtime.activate(generation)?;

    let nic_device = devices.find("virtio-net")
        .expect("no network device found");
    let mut nic_driver = services.nic_driver().launch(NicDriverBindings {
        device_mmio: nic_device.mmio(),
        interrupt: nic_device.interrupt(),
    })?;
    let nic: Cap<Nic> = nic_driver.ready_export()?;

    let mut net_stack = services.net_stack().launch(NetStackBindings { nic })?;
    let net_mgr: Cap<NetworkManager> = net_stack.ready_export()?;
    let tcp = net_mgr.create_tcp_pool();
    let mut http = services.http().launch(HttpBindings { tcp })?;
    let fetch: Cap<Fetch> = http.ready_export()?;

    let telemetry = services.telemetry().launch(TelemetryBindings {
        fetch: fetch.clone(),
    })?;

    let api_cap = fetch.attenuate(EndpointPolicy {
        origin: "https://api.example.com",
        paths: Some("/v1/users/*"),
        methods: Some(&["GET", "POST"]),
    });
    let app = services.my_service().launch(AppBindings { api: api_cap })?;

    supervisor_loop(&mut [nic_driver, net_stack, http, telemetry, app]);
}

Key Mechanisms

Cap export. The trusted launch-plan/controller implementation retains the raw process-management facets and exposes only revision-declared service exports. This is how the NIC driver makes its Nic cap available to the network stack without giving the init graph an arbitrary child cap-table manager.

Restart policy. Encoded in the authenticated ServiceRevision and enforced by its controller. When a child exits unexpectedly:

Old caps held by the child are automatically revoked (kernel invalidates the process’s cap table on exit)
The controller relaunches the same admitted revision under the same fixed plan and charged restart budget
New instance gets fresh caps — same authority, new identity

Dependency ordering. Sequential in code: wait() on exported caps blocks until the dependency is ready. No declarative dependency graph needed — Rust’s control flow is the dependency graph.

Service Taxonomy

Concrete categories of userspace services capOS expects to run. All spawned by init (or a supervisor init delegates to) after Stage 6. None are pre-init.

Hardware Drivers

One process per managed device. Each holds exactly the caps for its own hardware: an DeviceMmio slice, the corresponding Interrupt cap, and optionally a DmaRegion cap carved out of the frame allocator. Exports a typed device cap (Nic, BlockDevice, Framebuffer, Gpu, …). Examples: virtio-net, virtio-blk, NVMe, AHCI, framebuffer/GPU.

Platform Services

Logger / journal — accepts Log cap writes, forwards to console and/or durable storage. Init and kernel bootstrap use a direct Console cap until the logger is up; afterwards new services get Log caps only.
Filesystem — one per mounted volume. Consumes a BlockDevice cap, exports Directory / File caps. FAT, ext4, overlay, tmpfs.
Store — capability-native content-addressed storage backing persistent capability state (storage-and-naming-proposal.md).
Network stack — userspace TCP/IP (networking-proposal.md). Consumes Nic + Timer, exports NetworkManager, TcpSocket, UdpSocket, TcpListener.
DNS resolver — consumes a UdpSocket, exports Resolver.
Config / secrets store — reads the configuration root selected by the authenticated SystemGeneration, then exposes runtime Config and Secret caps with per-key attenuation.
Cloud metadata agent — detects IMDS / ConfigDrive / SMBIOS on cloud boot and delivers a ManifestDelta (cloud-metadata-proposal.md).
Upgrade manager — orchestrates CapRetarget for live service replacement (live-upgrade-proposal.md).
Capability proxy — makes selected local caps reachable over the network. The near-term shape is typed Cap’n Proto RPC or a schema-framed proxy, following Cloudflare’s production pattern of schema-bundled Workers bindings to internal services; later remote-capability sessions can borrow Spritely/OCapN CapTP’s session, handoff, and reference-lifetime model without treating current OCapN drafts as capOS ABI commitments. The proxy must never serialize local CapId values, endpoint generations, receiver selectors, or kernel/session ids as portable authority, and it must own explicit resource ledgers for remote refs, queued calls, streams, and retries. See Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web and Spritely, OCapN, and CapTP.
Measurement / attestation agent — consumes sealed boot measurements and authenticated generation identity, then exposes Quote caps for remote attestation.

Supervisors

Per-subsystem restart managers hold the ServiceController facets and role-specific LaunchPlan capabilities for the subtree they own. They do not hold the root constructor. If any child crashes, the supervisor applies the revision’s restart policy and charged restart budget. Example: net-supervisor owns the controllers for the NIC driver, net-stack, and DHCP client roles.

Application Services

User-facing or user-spawned processes: HTTP servers, API gateways, worker pools, shells, interactive tools. Hold only the narrow caps the supervisor grants (HttpEndpoint for one origin, Directory for one mount, etc.). Human users, service accounts, guests, and anonymous callers are represented by session/profile services that grant scoped cap bundles; they are not kernel subjects or ambient process credentials. See User Identity and Policy.

What Does Not Become a Service

Console / serial — stays in the kernel as a CapObject wrapper. Small enough, needed for kernel diagnostics, no benefit from userspace isolation. A userspace log service can layer on top.
Frame allocator, virtual memory, scheduler, ring dispatch — kernel primitives, exposed as caps but not as services.
Interrupt delivery, DMA mapping — kernel mechanisms, exposed to drivers as caps.
Boot measurement — if added, happens in the kernel before userspace receives bootstrap metadata; the measurement agent only reports it.

Supervision

Supervision Tree

Init doesn’t have to supervise everything directly. It can delegate:

flowchart TD
    Init[init: root launch runtime] --> Net[net-supervisor: network role controllers]
    Init --> Apps[app-supervisor: application launch plans]
    Net --> Nic[virtio-net driver]
    Net --> Stack[net-stack]
    Net --> Http[http-service]
    Apps --> App1[my-service]
    Apps --> App2[another-app]

Each supervisor is a process that holds only the controllers, launch plans, and dynamic binding capabilities for its subtree. The root launch runtime serves those plans and retains the underlying constructor and child managers. If net-supervisor crashes, init replaces it and re-delegates the same bounded controller set; the admitted networking revisions remain fixed.

Supervisor Loop

#![allow(unused)]
fn main() {
fn supervisor_loop(children: &mut [ServiceController]) {
    loop {
        let (index, exit_code) = wait_any(children);
        let child = &mut children[index];

        match child.revision().restart_policy() {
            RestartPolicy::Always => {
                child.restart()?;
            }
            RestartPolicy::OnFailure if exit_code != 0 => {
                child.restart()?;
            }
            _ => {
                // Process exited normally, don't restart
            }
        }
    }
}
}

Socket Activation

systemd pre-creates a socket and passes the fd to the service on first connection. In capOS, the supervisor does the same with caps:

Eager (default): supervisor spawns the child immediately with a TcpListener cap. Child calls accept() and blocks.

Lazy: supervisor holds the TcpListener cap itself. On first incoming connection (or on first accept() from a proxy cap), it spawns the child and transfers the cap. The child code is identical in both cases.

#![allow(unused)]
fn main() {
let listener = net_mgr.create_tcp_listener();
listener.bind([0,0,0,0], 8080);

let _conn = listener.accept();

// `listener` is a declared dynamic slot; the plan fixes image, role,
// remaining grants, resource budget, and lifecycle policy.
web_server_plan.launch(WebServerBindings { listener })?;
}

Configuration

See Storage and Naming for the full storage, naming, and configuration model.

Summary: the system topology is currently defined in a capnp-encoded system manifest baked into the boot image. tools/mkmanifest compiles the human-authored system.cue, system-smoke.cue, or focused manifest sources such as system-spawn.cue, system-devicemmio-grant.cue, and system-wasi-random.cue into the binary manifest. Default boot uses standalone init and init-owned service-graph execution; focused shell-led manifests still grant login/session/broker caps directly to capos-shell for narrow smokes. Focused init-executor manifests let the separate init binary validate and execute the manifest through ProcessSpawner; the old generic kernel resolver has been replaced by first-service cap construction. Manifest-declared SpawnGrantSource::Kernel entries cover the bounded DDF authority surface (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog) and the wasm-host’s optional EntropySource grant; the WASI host adapter (see WASI Host Adapter) and the POSIX adapter (see POSIX Adapter) both run as ordinary userspace processes spawned through this same path. Remaining cleanup is to move runtime configuration into a capability-based store service once that service exists. See also the layered CUE configuration model in System Configuration and Operator Extensibility.

Comparison with Traditional Approaches

Concern	systemd/Linux	capOS
Service dependencies	`Wants=`, `After=`, `Requires=`	Implicit in cap graph
Sandboxing	seccomp, namespaces, AppArmor	Default: zero ambient authority
Socket activation	`ListenStream=`, fd passing protocol	Pass `TcpListener` cap
Restart policy	`Restart=on-failure`	Supervisor process loop
Logging	journald, `StandardOutput=journal`	`Log` cap in granted set
Resource limits	cgroups, `MemoryMax=`, `CPUQuota=`	Bounded allocator caps
Network access control	firewall rules (iptables/nftables)	Scoped `HttpEndpoint` / `TcpSocket` caps
Config format	INI-like unit files (~1500 directives)	Rust code or minimal manifest
Trusted computing base	systemd PID 1 (~1.4M lines)	Init process (hundreds of lines)

Current Compatibility Spawn Mechanism

The implemented bootstrap path is capability-gated through ProcessSpawner. Possession is explicit broad authority to select a boot-package binary, compose supported grants, create supported child-local kernel objects, and receive management facets. This section records current behavior; it is not the target delegation surface. The target keeps the broad constructor private to init’s launch runtime and delegates role-specific plans as defined in Executable Images and Service Launch Authority.

Implemented Kernel Slice

The kernel now provides:

ProcessSpawner capability — a CapObject impl in kernel/src/cap/process_spawner.rs. Methods:
- spawn(name, binaryName, grants) -> handleIndex — resolve a boot-package binary, load ELF, create address space (builds on existing elf.rs loader and AddressSpace::new_user() in mem/paging.rs), populate the initial cap table, schedule the process, and return the ProcessHandle through the ring result-cap list
- the returned ProcessHandle cap lets the parent wait for child exit in the first slice; exported caps and kill semantics are later lifecycle work
Initial cap passing — at spawn time, the kernel copies permitted parent cap references into the child’s cap table or mints authorized child-local kernel caps. Raw grants preserve the source legacy badge. Endpoint-client grants may mint a requested legacy badge only from an endpoint owner or trusted parent endpoint result source; delegated client facets must preserve their existing service identity. Child-local Endpoint, FrameAllocator, and VirtualMemory grants are created for the child’s process. Child-local endpoint grants return parent-side client facets as result caps instead of sharing the endpoint owner object. The parent’s references are unaffected. Legacy endpoint badges are transitional; new multi-client service identity should use session-bound invocation context plus broker-granted service roots/facets.
Cap export — future lifecycle work will let a child register a cap by name in its ProcessHandle, making it available to the parent (or anyone holding the handle). This is the mechanism behind nic_driver.exported("nic").wait() once exported-cap lookup is added.

Schema

interface ProcessSpawner {
    spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (
        handleIndex :UInt16,
        capabilityManagerIndex :UInt16,
    );
    createPipe @1 (bufferBytes :UInt32) -> (readIndex :UInt16, writeIndex :UInt16);
}

struct CapGrant {
    name @0 :Text;
    capId @1 :UInt32;
    interfaceId @2 :UInt64;
    mode @3 :CapGrantMode;
    badge @4 :UInt64;
    source @5 :CapGrantSource;
}

struct CapGrantSource {
    union {
        capability @0 :Void;
        kernel @1 :KernelCapSource;
    }
}

enum CapGrantMode {
    raw @0;
    clientEndpoint @1;
    move @2;
    serviceObject @3;
}

interface ProcessHandle {
    wait @0 () -> (exitCode :Int64);
    terminate @1 () -> ();
}

This is the current compatibility schema. The target image-based constructor does not retain binaryName, ordinary callers do not author arbitrary CapGrantSource values, and pipe allocation moves to an independent PipeFactory capability.

Note on capability passing: Capabilities are referenced by cap table slot IDs (UInt32), not by Cap’n Proto’s native capability table mechanism. spawn() returns the ProcessHandle and a CapabilityManager cap through the ring result-cap list; handleIndex and capabilityManagerIndex identify those transferred caps in the completion. The first slice passes a boot-package binaryName instead of raw ELF bytes so the request stays within the bounded ring parameter buffer. terminate (deferred kill) is implemented on ProcessHandle; CapabilityManager.grant now copies a live copy-transferable caller hold into its generation-bound child under session-scope and child resource-limit checks. Runtime exported-cap lookup remains future lifecycle work. capOS uses manual capnp dispatch (CapObject trait with raw message bytes, not capnp-rpc), so cap references are plain integers and typed result caps use the ring transfer-result metadata. See Userspace Binaries Part 7 for the surrounding userspace bootstrap schema context, Part 4 for the POSIX adapter surface that consumes ProcessSpawner.createPipe plus the recording-shim fork-for-exec successor posix_spawn over the same Move-grant path, and Part 5 for the WASI host adapter that runs as a userspace process spawned through this same ProcessSpawner with manifest-supplied capability grants (WASI Host Adapter).

Relationship to Existing Code

The current kernel has these pieces in place:

ELF loading (kernel/src/elf.rs) — parses PT_LOAD segments, validates alignment, and feeds the reusable spawn primitive behind ProcessSpawner.
Address space creation (kernel/src/mem/paging.rs) — AddressSpace::new_user() creates isolated page tables with the kernel mapped in the upper half.
Cap table (kernel/src/cap/table.rs) — CapTable with insert(), get(), remove(), transfer preflight, provisional insert, commit, and rollback helpers. Each Process owns one local table.
Process struct and scheduler (kernel/src/process.rs, kernel/src/sched.rs) — a process table plus round-robin run queue are in place for both legacy manifest-spawned services and init-spawned children.

Generic capability transfer/release and the reusable ProcessSpawner lifecycle path are complete enough for the focused init-owned spawn executor. Default startup now uses standalone init for service-graph execution, while focused shell-led startup remains for narrow smokes. ProcessSpawner.createPipe extends the lifecycle surface with a bounded SPSC kernel Pipe capability consumed by the POSIX adapter’s recording-shim fork-for-exec path (P1.3) and exposed as the posix_spawn successor on the same Move-grant path. The DDF Task 5 grant-source families (devicemmio_grant_source.rs, dmapool_grant_source.rs, and their interrupt/audit peers) extend SpawnGrantSource::Kernel with the bounded manager-issued DDF authority surface; production handle lifecycle, hardware- backed driver wait/ack dispatch beyond bounded route proofs, and the S.11.2 hostile-smoke gates remain open. Each spawned process also receives one immutable session context (default-inherited from the parent or broker-selected), used as the invocation subject for audit attribution and the identity-policy boundary. Manager-mediated post-spawn copy grants are now implemented; remaining lifecycle gaps are runtime exported-cap lookup, restart supervision, and shrinking the transitional manifest schema. ProcessHandle.terminate (deferred kill) is implemented.

Prerequisites

Prerequisite	Status	Why
ELF loading + address spaces	Done (Stage 2-3)	`elf.rs`, `AddressSpace::new_user()`
Capability ring + cap_enter	Done (Stage 4/6 foundation)	Ring-based cap invocation with blocking waits
Scheduling + preemption (core)	Done (Stage 5)	Round-robin, PIT 100 Hz, context switch
Cross-process Endpoint IPC	Done (Stage 6 foundation)	CALL/RECV/RETURN routing through Endpoint objects
Generic cap transfer/release	Done (Stage 6, 2026-04-22/24)	Copy/move transfer, result-cap insertion, `CAP_OP_RELEASE`, epoch revocation, and revoked endpoint `Disconnected` error surface
ProcessSpawner + ProcessHandle + CapabilityManager	Done (Stage 6, 2026-07-18 13:39 UTC)	Init-driven spawn with grants, `wait`/`terminate`, generation-bound child `list`/`revoke`/post-spawn copy grant, hostile-input coverage, and audited grant outcomes
ProcessSpawner.createPipe + recording-shim fork-for-exec	Done (POSIX adapter P1.3, 2026-05-07 09:55 UTC)	Bounded SPSC `Pipe` capability and Move-grant fork-for-exec successor; see POSIX Adapter §Phase P1.3 and Userspace Binaries Part 4
DDF bootstrap-grant sources (`DeviceMmio`, `DMAPool`, `Interrupt`, `HardwareAuditLog`)	In progress (DDF Task 5)	Bounded manager-issued authority over `SpawnGrantSource::Kernel`; production handle lifecycle and S.11.2 hostile smokes remain open. See device-driver-foundation.md Task 5
Immutable per-process session context	Done (`kernel/src/session_context.rs`)	One session context per process, default-inherited or broker-selected; `make run-session-context` proof
Authority graph + quota design (Security Verification Track S.9)	Done (2026-04-21)	Defines transfer/spawn invariants, per-process quotas, and rollback rules; see `docs/authority-accounting-transfer-design.md`

This proposal describes the target architecture. Individual pieces (like Fetch/HttpEndpoint) are additive — they’re userspace processes that compose existing caps into higher-level ones. No kernel changes needed beyond Stages 4-6.

First Step After Transfer and ProcessSpawner — done 2026-04-23

The minimal demonstration of this architecture landed together with capability transfer and ProcessSpawner:

ProcessSpawner cap in kernel/src/cap/process_spawner.rs wraps ELF loading and address-space creation behind a typed capability.
Init spawns children — focused make run-spawn boots a single-init manifest; the kernel boots only the separate init binary from initConfig.init, then init spawns the focused demo graph from initConfig.services through ProcessSpawner, grants child-local endpoint owners and client facets, then releases parent endpoint facets before waiting on each ProcessHandle.
Cross-process cap invocation — spawned client invokes the server’s Endpoint cap, server replies, both print to console.

This exercises: spawn cap, initial cap passing, manifest-declared export recording, cross-process cap invocation, hostile-input rejection, and per-process resource exhaustion paths. Deleting the unused legacy kernel resolver is post-milestone cleanup tracked in loopyard.

Open Questions

Restart supervision. Epoch-based cap revocation and generation-tagged stale reference detection are implemented for current grant/revoke flows. Restart policy still needs a supervisor contract that epoch-bumps caps served by the failed process, restarts from the manifest, and reconnects clients through explicit authority rather than ambient service lookup.
Cap discovery. How does a process learn what caps it was given? Resolved: name→(cap_id, interface_id) mapping passed at spawn via a well-known page (CapSet). See Userspace Binaries Part 2. cap_id is the authority-bearing table handle. interface_id is the transported capnp TYPE_ID used by typed clients to check that the handle speaks the expected interface.
Lazy spawning. Should the init process start everything eagerly, or should caps be backed by lazy proxies that spawn the backing service on first invocation?
Cap persistence. If the system reboots, should the cap graph be reconstructable from saved state? Or is it always rebuilt from init code?
Delegation depth. Can an application further delegate its HttpEndpoint cap to a subprocess? If so, the HTTP gateway needs to support fan-out. If not, how is this restriction enforced?

Keyboard shortcuts

capOS Documentation