Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: OOM Handling and Swap

How capOS should behave under memory pressure, what “out of memory” means at different boundaries, and how optional swap support fits the capability model.

Problem

capOS already has several local out-of-memory paths:

  • boot-time allocation failures that are still fatal,
  • service-facing operations that return a controlled error,
  • rollback paths that free partially allocated state, and
  • hostile-path tests that prove some frame-exhaustion cases.

What the tree does not have yet is a coherent memory-pressure policy. There is no system-wide answer to these questions:

  • When should an allocation fail immediately vs. trigger reclaim?
  • Which memory is reclaimable, swappable, or permanently pinned?
  • What outcome should a process observe when a page fault cannot be satisfied?
  • Who is allowed to decide that another process should die under memory pressure?

Without that policy, the codebase will drift into a mix of local conventions: some paths return Overloaded, some return interface-specific failure text, some remain boot-fatal, and future swap support would have no clear ownership or threat model.

Design Goals

  1. No ambient OOM killer. The kernel must not scan the system for an arbitrary victim and kill it Linux-style.
  2. Explicit accounting. Memory exhaustion must be understood in terms of budgets, commitments, and reclaimability, not just “the allocator returned None.”
  3. Typed failure semantics. Callers must be able to distinguish invalid requests, local budget exhaustion, transient pressure, and fatal page-fault failure.
  4. Fail closed. Memory-pressure code must not corrupt capability state, silently drop dirty data, or leave half-constructed kernel objects behind.
  5. Swap is optional. capOS must work without swap. Swap is a policy and deployment choice, not a baseline requirement.
  6. Security first. Swap must not become a secret-leak side channel or an integrity hole.

Non-Goals

  • Transparent global persistence in the EROS sense.
  • General-purpose overcommit as the default memory model.
  • Swapping kernel metadata, capability rings, CapSet pages, or DMA-pinned memory.
  • A userspace pager dependency in the first swap implementation.

Design Grounding

This proposal deliberately borrows from three existing design directions in the research set:

  • Genode: strict memory accounting and quota donation are the right default because they avoid an ambient OOM killer and make responsibility obvious.
  • seL4: explicit memory authority is preferable to a kernel that can create new backing objects out of thin air when under pressure.
  • EROS / CapROS / Coyotos: do not make implicit persistent backing store the baseline. capOS already chose explicit persistence and should not back into a single-level-store design through swap.

The result is not a copy of any of those systems. capOS keeps explicit capability-granted memory objects and ordinary page tables, but adopts the accounting discipline that makes OOM behavior reviewable.

Core Policy

1. No Overcommit by Default

The default rule is simple: a process may only create anonymous memory if the system can charge that commitment to a real budget.

That means:

  • anonymous VirtualMemory.commit and compatibility VirtualMemory.map consume committed-page budget,
  • anonymous VirtualMemory.reserve consumes virtual address-space quota only and does not promise physical backing,
  • resident pages consume real frame availability when they are instantiated,
  • swap, when enabled, extends commitment capacity only for memory classes that explicitly allow it,
  • and no interface may assume that a later background OOM killer will clean up a bad admission decision.

This follows the same principle as capability authority in general: if a child needs more memory, some parent or broker must have chosen to give it that room.

2. The Kernel Never Picks a Random Victim

When memory is tight, the kernel may:

  • reclaim kernel-known clean caches,
  • free resources from already-dead processes,
  • swap out eligible anonymous pages,
  • reject a new allocation,
  • or terminate the faulting process when its own page cannot be restored.

What it must not do is kill an unrelated process just because it happens to be large. Cross-process eviction is a supervisor policy decision, not a kernel allocator side effect.

Supervisors remain free to implement their own policy. A shell/session broker or future service manager can decide to stop a child, reduce its budget, or restart it. That decision is explicit and auditable rather than hidden inside the low-level frame allocator.

3. Distinguish Four Memory Outcomes

capOS should treat these as different cases, not variants of one string:

SituationRequired behavior
Invalid request (size=0, misaligned range, quota metadata malformed)Deterministic failed / request validation error
Caller exhausted its allowed budgetDeterministic overloaded or typed outOfMemory result
Global pressure, but reclaim/swap may succeedReclaim first, then retry locally
Faulting page cannot be restored or committedTerminate the faulting process with an explicit OOM exit reason

The important distinction is between synchronous API failure and asynchronous execution failure. If a capability call asks for more memory, it should get an error back. If a process touches a swapped-out page and the system cannot bring it back, there is no capability return value to encode. That must be a process-lifecycle event.

Memory Classes

The reclaim policy depends on what kind of memory is being discussed.

ClassExamplesReclaim policy
Kernel-reserved, unswappablekernel heap, page tables, scheduler/process metadata, cap-table backing, ring scratchNever swap; pressure here is a kernel-capacity problem
User pinned, unswappablecapability ring page, CapSet page, DMA buffers, wired mappings, key material, future mlock-style regionsNever swap; allocation fails if unavailable
Reclaimable clean cacheboot-package cache, future filesystem cache, executable pages that can be reloaded, clean read-only object pagesDrop and refetch rather than swap
Anonymous private swappableordinary heap/stack/anonymous VM pages that opt into swapSwap-eligible if policy allows it
Shared/persistent object pagesMemoryObject, mapped content-addressed store pages, future file-backed shared memoryNot part of phase-1 swap; treat as reclaim/drop or keep resident based on object semantics

Two rules matter here:

  1. Clean cache is not swap. If a page can be reconstructed from a trusted backing object without preserving dirty state, reclaim it by dropping it.
  2. Pinned means pinned. If a page participates in DMA, capability transport, bootstrap identity, or secret handling, treat it as unswappable unless a later design proves otherwise.

DMA pages are a pinned residency class with additional lifecycle constraints: they must be committed before exposure to the device, resident for the entire device-visible lifetime, unswappable while mapped by a DMAPool or IOMMU domain, and scrubbed before release to another owner. Reclaim is not allowed to make progress on a DMA page; pressure must surface as admission failure or device-manager teardown.

Device-written DMA pages are untrusted input until validated by the owning driver or network/storage stack. Pinning and residency prevent reclaim races; they do not make device bytes trustworthy, nor do they grant ordinary MemoryObject authority over the backing frames.

Failure Semantics by Boundary

Capability Calls

For explicit allocation requests, return a structured failure rather than panicking:

  • VirtualMemory.map should return overloaded or a typed OOM result when the request cannot be satisfied.
  • ProcessSpawner.spawn should continue the current direction: bounded parsing, fallible allocation, Overloaded on resource exhaustion.
  • Future interfaces where OOM is a normal domain outcome should prefer a typed union result rather than an exception string.

This is consistent with the existing error-handling proposal: temporary resource exhaustion is not the same thing as malformed input.

Page Faults

Page faults are different. A faulting instruction does not have a natural request/response channel. The policy should therefore be:

  1. attempt reclaim,
  2. attempt swap-out of another eligible page if that creates room,
  3. attempt swap-in or zero-fill for the requested page,
  4. if that still fails, terminate the faulting process with a typed exit reason such as outOfMemory.

That is not an ambient OOM killer. It is the equivalent of delivering an unrecoverable execution fault to the process whose own memory access could not be satisfied.

Boot

Boot remains a special case. If the kernel cannot allocate its own core heap, page tables, or init process, the system cannot proceed. Those failures remain boot-fatal until the architecture moves more kernel object memory under explicit authority.

This proposal does not pretend otherwise. It narrows runtime behavior first and only then pushes on the deeper architectural question of who funds kernel objects.

Budget Model

The long-term model should separate commitment from residency.

  • Reserved virtual pages: address-space ranges the process owns but that do not yet promise physical backing. The Go allocator contract charges these to a separate virtual-reservation quota.
  • Committed pages: memory the system has promised can exist for a process. This is what VirtualMemory.commit, compatibility VirtualMemory.map, and future runtime heap growth should charge.
  • Resident pages: memory currently backed by a physical frame.
  • Pinned pages: resident pages that reclaim and swap may not touch.
  • Swapped pages: committed but non-resident anonymous pages with an encrypted slot on a swap area.

The detailed Go/runtime ABI for splitting virtual reservation from physical commitment is Go VirtualMemory Contract. This proposal’s no-overcommit rule applies at commit time, not at pure reservation time.

At spawn time, a parent or broker should be able to set a memory budget for the child. A minimal future shape is:

struct MemoryBudget {
    committedPages @0 :UInt32;
    pinnedPagesMax @1 :UInt32;
    allowSwap @2 :Bool;
    swapPagesMax @3 :UInt32;
    virtualReservationPagesMax @4 :UInt64;
}

This budget does not require capOS to expose Linux-style cgroups. It is a capability-native admission contract between parent and child.

Swap Support

Position

Swap is useful, but only as a constrained extension of the non-overcommit model.

Swap must not mean:

  • “pretend RAM is infinite,”
  • “the kernel can now kill random processes later,”
  • or “all memory classes are equivalent.”

Instead, swap means: some anonymous pages may be evicted to an encrypted backing area, subject to explicit budgets and page-class rules.

Phase-1 Swap Scope

The first swap implementation should be intentionally narrow:

  • only anonymous private pages created through VirtualMemory,
  • only for mappings that are explicitly swappable,
  • no swapfiles,
  • no filesystem dependency,
  • no userspace pager in the fault path,
  • no swapping of MemoryObject result caps, shared IPC pages, or device/DMA memory.

That scope is small on purpose. Once the first swap implementation exists, expanding eligibility is easy; debugging a too-clever pager in the page-fault path is not.

Backing Store

Phase 1 should use a dedicated swap extent, not a regular file.

Reasons:

  • a file-backed swap path drags in namespace, filesystem, metadata writeback, and deadlock questions too early,
  • a dedicated extent is easier to bound and reason about,
  • and encryption/integrity policy is cleaner when the medium is dedicated to swap slots.

Provisioning should happen through init or a future storage broker that discovers a block extent and passes it into a kernel configuration path.

Compression

Compressed swap caches are a reasonable later optimization, but not the first one to build.

Linux’s zswap design is a useful warning here: it keeps a dynamically sized compressed pool in RAM and evicts from that pool to a backing swap device when the pool reaches its limit. That can improve I/O behavior, but it also creates another reclaim tier with its own sizing, hysteresis, and writeback policy.

capOS should not start there. Phase 1 should write eligible pages directly to the encrypted swap extent. A compressed in-RAM layer can be added later only after the basic swap accounting, eviction, integrity, and observability rules are stable.

Encryption and Integrity

Swap must be encrypted by default.

The crypto policy should match the existing key-management and volume-encryption direction:

  • use a fresh per-boot ephemeral symmetric key that lives only in RAM,
  • never persist that key,
  • invalidate all prior swap contents on boot,
  • authenticate every swapped page so stale-slot replay and random corruption do not silently produce attacker-controlled plaintext.

This has one deliberate consequence: hibernation is out of scope for the first design. Per-boot keys make resume-across-reboot impossible, which is the correct tradeoff for an early capability OS that does not yet have a full trusted suspend/resume story.

Page Eligibility

A mapping should carry an explicit policy bit or enum rather than forcing all anonymous pages into one bucket.

A future VirtualMemory.map shape should move from bare protection flags to options that express residency policy:

enum MemoryResidency {
    normal @0;     # reclaimable, swap if allowed by budget
    pinned @1;     # must stay resident
    secret @2;     # resident only; zero aggressively; never swap
}

This is a better fit than inventing ad hoc “don’t swap this one page” special cases later for crypto heaps, broker secrets, or device buffers.

Fault Path Semantics

On a page fault to a swapped-out page:

  1. the kernel locates the slot metadata,
  2. allocates or frees a frame through reclaim,
  3. reads and authenticates the page,
  4. remaps the page,
  5. resumes the process.

If the slot cannot be restored because no frame can be made available, or the page fails integrity validation, the kernel terminates the faulting process with a distinct exit reason. It must not inject zeros, fabricate stale data, or retry indefinitely.

Why Not a Userspace Pager First

A pure userspace pager is attractive in theory but wrong as the initial step. The current kernel does not have the scheduler, storage, and fault-notification machinery needed to make page-fault RPC safe and bounded under memory pressure.

The first swap design should therefore keep the fault mechanism and slot metadata in kernel while keeping the provisioning and high-level policy outside the kernel where possible.

An external pager can remain a later phase once capOS has:

  • notifications,
  • richer process/thread lifecycle control,
  • deadlock-resistant fault upcalls,
  • and a storage stack that can be driven safely during memory pressure.

Interface and Lifecycle Changes

This proposal implies a few interface changes, even if the exact schema names change later.

Process Exit Reporting

Supervisors need to know whether a child:

  • exited normally,
  • hit a capability exception,
  • faulted on memory corruption,
  • or died because memory pressure could not be satisfied.

That argues for a typed exit record rather than flattening everything into one numeric code.

Spawn-Time Memory Budgets

ProcessSpawner should eventually accept resource limits, including a memory budget, rather than assuming every child competes in one shared frame pool.

Monitoring

A future monitoring/status surface should expose at least:

  • committed pages,
  • resident pages,
  • pinned pages,
  • swapped pages,
  • swap I/O failures,
  • reclaim counts,
  • and per-process OOM termination counts.

Without that, operators will not be able to distinguish “the child leaked heap” from “the kernel pinned too much unswappable state.”

Security Requirements

Memory-pressure code is security-sensitive, not just performance-sensitive.

Required properties:

  • reclaim and swap metadata operations are bounded and fail closed,
  • swap ciphertext is authenticated, not just encrypted,
  • freed swap slots cannot be read by another process,
  • secret/pinned mappings never spill to swap,
  • swap enable/disable transitions do not expose stale plaintext,
  • and pressure paths avoid allocation where possible.

The last point matters because allocating heap memory while handling OOM is how systems spiral into recursive failure and panic surfaces.

Relationship to Existing Proposals

  • Error Handling: resource exhaustion should map to overloaded or typed OOM results at explicit call boundaries, not generic panic text.
  • Service Architecture: parents and supervisors should own memory budgets just as they own capability grants.
  • Storage and Naming: swap should use explicit backing extents, not ambient filesystem paths.
  • Volume Encryption / Key Management: swap encryption uses a per-boot ephemeral symmetric key; persistent encryption keys are unnecessary for the first design.

Phases

Phase 0: Normalize Runtime OOM Semantics

  • Remove remaining runtime panic surfaces on untrusted allocation paths.
  • Distinguish boot-fatal OOM from service-facing overloaded.
  • Add typed process-exit reporting for OOM and faulted swap-in.

Phase 1: Budgeted Anonymous Memory

  • Add spawn-time memory budgets.
  • Charge anonymous VirtualMemory.commit and compatibility VirtualMemory.map against committed-page budget.
  • Charge anonymous VirtualMemory.reserve against virtual address-space quota.
  • Mark pinned vs. swappable vs. secret mappings explicitly.

Phase 2: Reclaim Without Swap

  • Add clean-cache reclaim and dead-process cleanup accounting.
  • Expose pressure metrics and events.
  • Keep allocation failure deterministic when reclaim cannot help.

Phase 3: Encrypted Kernel-Managed Swap

  • Add dedicated swap extent provisioning.
  • Add encrypted/authenticated page slots with per-boot ephemeral keying.
  • Support swap for anonymous private pages only.
  • Terminate the faulting process cleanly when swap-in cannot succeed.

Phase 4: Optional External Pager

  • Revisit pager upcalls only after notifications, richer lifecycle control, and storage-stack maturity exist.
  • Keep the kernel fault path bounded even if policy moves outward.

Open Questions

  1. Should capOS ever add demand commit on first access after the explicit reserve/commit contract, or should runtime allocators keep making commitment visible through capability calls?
  2. Should executable anonymous pages be swappable in phase 1, or should swap be limited to writable anonymous pages until code-loading semantics mature?
  3. When MemoryObject grows richer sharing semantics, should some subclasses be reclaimable-from-backing rather than unswappable?
  4. Does a future secret mapping need stronger guarantees than “never swap,” such as forced zero-on-fork, no-core-dump, and cache-flush hooks?
  5. How much kernel memory should remain permanently reserved before the system starts admitting user commitments?

Bottom Line

capOS should treat OOM as an authority and lifecycle problem, not as a last-gap allocator surprise. The default system should use explicit budgets and no overcommit, return typed exhaustion at API boundaries, reserve process death only for unsatisfied execution faults, and add encrypted swap later as a narrow extension for anonymous private pages.