Proposal: OOM Handling and Swap
How capOS should behave under memory pressure, what “out of memory” means at different boundaries, and how optional swap support fits the capability model.
Related
- Memory Management documents the current
implemented frame, page-table,
VirtualMemory, andMemoryObjectbehavior. - Go VirtualMemory Contract defines the near-term distinction between virtual reservation and physical commit that this proposal builds on.
- Resource Accounting and Quotas defines the ledger vocabulary used for memory-pressure policy.
Problem
capOS already has several local out-of-memory paths:
- boot-time allocation failures that are still fatal,
- service-facing operations that return a controlled error,
- rollback paths that free partially allocated state, and
- hostile-path tests that prove some frame-exhaustion cases.
What the tree does not have yet is a coherent memory-pressure policy. There is no system-wide answer to these questions:
- When should an allocation fail immediately vs. trigger reclaim?
- Which memory is reclaimable, swappable, or permanently pinned?
- What outcome should a process observe when a page fault cannot be satisfied?
- Who is allowed to decide that another process should die under memory pressure?
Without that policy, the codebase will drift into a mix of local conventions:
some paths return Overloaded, some return interface-specific failure text,
some remain boot-fatal, and future swap support would have no clear ownership
or threat model.
Design Goals
- No ambient OOM killer. The kernel must not scan the system for an arbitrary victim and kill it Linux-style.
- Explicit accounting. Memory exhaustion must be understood in terms of
budgets, commitments, and reclaimability, not just “the allocator returned
None.” - Typed failure semantics. Callers must be able to distinguish invalid requests, local budget exhaustion, transient pressure, and fatal page-fault failure.
- Fail closed. Memory-pressure code must not corrupt capability state, silently drop dirty data, or leave half-constructed kernel objects behind.
- Swap is optional. capOS must work without swap. Swap is a policy and deployment choice, not a baseline requirement.
- Security first. Swap must not become a secret-leak side channel or an integrity hole.
Non-Goals
- Transparent global persistence in the EROS sense.
- General-purpose overcommit as the default memory model.
- Swapping kernel metadata, capability rings, CapSet pages, or DMA-pinned memory.
- A userspace pager dependency in the first swap implementation.
Design Grounding
This proposal deliberately borrows from three existing design directions in the research set:
- Genode: strict memory accounting and quota donation are the right default because they avoid an ambient OOM killer and make responsibility obvious.
- seL4: explicit memory authority is preferable to a kernel that can create new backing objects out of thin air when under pressure.
- EROS / CapROS / Coyotos: do not make implicit persistent backing store the baseline. capOS already chose explicit persistence and should not back into a single-level-store design through swap.
The result is not a copy of any of those systems. capOS keeps explicit capability-granted memory objects and ordinary page tables, but adopts the accounting discipline that makes OOM behavior reviewable.
Core Policy
1. No Overcommit by Default
The default rule is simple: a process may only create anonymous memory if the system can charge that commitment to a real budget.
That means:
- anonymous
VirtualMemory.commitand compatibilityVirtualMemory.mapconsume committed-page budget, - anonymous
VirtualMemory.reserveconsumes virtual address-space quota only and does not promise physical backing, - resident pages consume real frame availability when they are instantiated,
- swap, when enabled, extends commitment capacity only for memory classes that explicitly allow it,
- and no interface may assume that a later background OOM killer will clean up a bad admission decision.
This follows the same principle as capability authority in general: if a child needs more memory, some parent or broker must have chosen to give it that room.
2. The Kernel Never Picks a Random Victim
When memory is tight, the kernel may:
- reclaim kernel-known clean caches,
- free resources from already-dead processes,
- swap out eligible anonymous pages,
- reject a new allocation,
- or terminate the faulting process when its own page cannot be restored.
What it must not do is kill an unrelated process just because it happens to be large. Cross-process eviction is a supervisor policy decision, not a kernel allocator side effect.
Supervisors remain free to implement their own policy. A shell/session broker or future service manager can decide to stop a child, reduce its budget, or restart it. That decision is explicit and auditable rather than hidden inside the low-level frame allocator.
3. Distinguish Four Memory Outcomes
capOS should treat these as different cases, not variants of one string:
| Situation | Required behavior |
|---|---|
Invalid request (size=0, misaligned range, quota metadata malformed) | Deterministic failed / request validation error |
| Caller exhausted its allowed budget | Deterministic overloaded or typed outOfMemory result |
| Global pressure, but reclaim/swap may succeed | Reclaim first, then retry locally |
| Faulting page cannot be restored or committed | Terminate the faulting process with an explicit OOM exit reason |
The important distinction is between synchronous API failure and asynchronous execution failure. If a capability call asks for more memory, it should get an error back. If a process touches a swapped-out page and the system cannot bring it back, there is no capability return value to encode. That must be a process-lifecycle event.
Memory Classes
The reclaim policy depends on what kind of memory is being discussed.
| Class | Examples | Reclaim policy |
|---|---|---|
| Kernel-reserved, unswappable | kernel heap, page tables, scheduler/process metadata, cap-table backing, ring scratch | Never swap; pressure here is a kernel-capacity problem |
| User pinned, unswappable | capability ring page, CapSet page, DMA buffers, wired mappings, key material, future mlock-style regions | Never swap; allocation fails if unavailable |
| Reclaimable clean cache | boot-package cache, future filesystem cache, executable pages that can be reloaded, clean read-only object pages | Drop and refetch rather than swap |
| Anonymous private swappable | ordinary heap/stack/anonymous VM pages that opt into swap | Swap-eligible if policy allows it |
| Shared/persistent object pages | MemoryObject, mapped content-addressed store pages, future file-backed shared memory | Not part of phase-1 swap; treat as reclaim/drop or keep resident based on object semantics |
Two rules matter here:
- Clean cache is not swap. If a page can be reconstructed from a trusted backing object without preserving dirty state, reclaim it by dropping it.
- Pinned means pinned. If a page participates in DMA, capability transport, bootstrap identity, or secret handling, treat it as unswappable unless a later design proves otherwise.
DMA pages are a pinned residency class with additional lifecycle constraints:
they must be committed before exposure to the device, resident for the entire
device-visible lifetime, unswappable while mapped by a DMAPool or IOMMU
domain, and scrubbed before release to another owner. Reclaim is not allowed to
make progress on a DMA page; pressure must surface as admission failure or
device-manager teardown.
Device-written DMA pages are untrusted input until validated by the owning
driver or network/storage stack. Pinning and residency prevent reclaim races;
they do not make device bytes trustworthy, nor do they grant ordinary
MemoryObject authority over the backing frames.
Failure Semantics by Boundary
Capability Calls
For explicit allocation requests, return a structured failure rather than panicking:
VirtualMemory.mapshould returnoverloadedor a typed OOM result when the request cannot be satisfied.ProcessSpawner.spawnshould continue the current direction: bounded parsing, fallible allocation,Overloadedon resource exhaustion.- Future interfaces where OOM is a normal domain outcome should prefer a typed union result rather than an exception string.
This is consistent with the existing error-handling proposal: temporary resource exhaustion is not the same thing as malformed input.
Page Faults
Page faults are different. A faulting instruction does not have a natural request/response channel. The policy should therefore be:
- attempt reclaim,
- attempt swap-out of another eligible page if that creates room,
- attempt swap-in or zero-fill for the requested page,
- if that still fails, terminate the faulting process with a typed exit
reason such as
outOfMemory.
That is not an ambient OOM killer. It is the equivalent of delivering an unrecoverable execution fault to the process whose own memory access could not be satisfied.
Boot
Boot remains a special case. If the kernel cannot allocate its own core heap, page tables, or init process, the system cannot proceed. Those failures remain boot-fatal until the architecture moves more kernel object memory under explicit authority.
This proposal does not pretend otherwise. It narrows runtime behavior first and only then pushes on the deeper architectural question of who funds kernel objects.
Budget Model
The long-term model should separate commitment from residency.
- Reserved virtual pages: address-space ranges the process owns but that do not yet promise physical backing. The Go allocator contract charges these to a separate virtual-reservation quota.
- Committed pages: memory the system has promised can exist for a process.
This is what
VirtualMemory.commit, compatibilityVirtualMemory.map, and future runtime heap growth should charge. - Resident pages: memory currently backed by a physical frame.
- Pinned pages: resident pages that reclaim and swap may not touch.
- Swapped pages: committed but non-resident anonymous pages with an encrypted slot on a swap area.
The detailed Go/runtime ABI for splitting virtual reservation from physical commitment is Go VirtualMemory Contract. This proposal’s no-overcommit rule applies at commit time, not at pure reservation time.
At spawn time, a parent or broker should be able to set a memory budget for the child. A minimal future shape is:
struct MemoryBudget {
committedPages @0 :UInt32;
pinnedPagesMax @1 :UInt32;
allowSwap @2 :Bool;
swapPagesMax @3 :UInt32;
virtualReservationPagesMax @4 :UInt64;
}
This budget does not require capOS to expose Linux-style cgroups. It is a capability-native admission contract between parent and child.
Swap Support
Position
Swap is useful, but only as a constrained extension of the non-overcommit model.
Swap must not mean:
- “pretend RAM is infinite,”
- “the kernel can now kill random processes later,”
- or “all memory classes are equivalent.”
Instead, swap means: some anonymous pages may be evicted to an encrypted backing area, subject to explicit budgets and page-class rules.
Phase-1 Swap Scope
The first swap implementation should be intentionally narrow:
- only anonymous private pages created through
VirtualMemory, - only for mappings that are explicitly swappable,
- no swapfiles,
- no filesystem dependency,
- no userspace pager in the fault path,
- no swapping of
MemoryObjectresult caps, shared IPC pages, or device/DMA memory.
That scope is small on purpose. Once the first swap implementation exists, expanding eligibility is easy; debugging a too-clever pager in the page-fault path is not.
Backing Store
Phase 1 should use a dedicated swap extent, not a regular file.
Reasons:
- a file-backed swap path drags in namespace, filesystem, metadata writeback, and deadlock questions too early,
- a dedicated extent is easier to bound and reason about,
- and encryption/integrity policy is cleaner when the medium is dedicated to swap slots.
Provisioning should happen through init or a future storage broker that discovers a block extent and passes it into a kernel configuration path.
Compression
Compressed swap caches are a reasonable later optimization, but not the first one to build.
Linux’s zswap design is a useful warning here: it keeps a dynamically sized
compressed pool in RAM and evicts from that pool to a backing swap device when
the pool reaches its limit. That can improve I/O behavior, but it also creates
another reclaim tier with its own sizing, hysteresis, and writeback policy.
capOS should not start there. Phase 1 should write eligible pages directly to the encrypted swap extent. A compressed in-RAM layer can be added later only after the basic swap accounting, eviction, integrity, and observability rules are stable.
Encryption and Integrity
Swap must be encrypted by default.
The crypto policy should match the existing key-management and volume-encryption direction:
- use a fresh per-boot ephemeral symmetric key that lives only in RAM,
- never persist that key,
- invalidate all prior swap contents on boot,
- authenticate every swapped page so stale-slot replay and random corruption do not silently produce attacker-controlled plaintext.
This has one deliberate consequence: hibernation is out of scope for the first design. Per-boot keys make resume-across-reboot impossible, which is the correct tradeoff for an early capability OS that does not yet have a full trusted suspend/resume story.
Page Eligibility
A mapping should carry an explicit policy bit or enum rather than forcing all anonymous pages into one bucket.
A future VirtualMemory.map shape should move from bare protection flags to
options that express residency policy:
enum MemoryResidency {
normal @0; # reclaimable, swap if allowed by budget
pinned @1; # must stay resident
secret @2; # resident only; zero aggressively; never swap
}
This is a better fit than inventing ad hoc “don’t swap this one page” special cases later for crypto heaps, broker secrets, or device buffers.
Fault Path Semantics
On a page fault to a swapped-out page:
- the kernel locates the slot metadata,
- allocates or frees a frame through reclaim,
- reads and authenticates the page,
- remaps the page,
- resumes the process.
If the slot cannot be restored because no frame can be made available, or the page fails integrity validation, the kernel terminates the faulting process with a distinct exit reason. It must not inject zeros, fabricate stale data, or retry indefinitely.
Why Not a Userspace Pager First
A pure userspace pager is attractive in theory but wrong as the initial step. The current kernel does not have the scheduler, storage, and fault-notification machinery needed to make page-fault RPC safe and bounded under memory pressure.
The first swap design should therefore keep the fault mechanism and slot metadata in kernel while keeping the provisioning and high-level policy outside the kernel where possible.
An external pager can remain a later phase once capOS has:
- notifications,
- richer process/thread lifecycle control,
- deadlock-resistant fault upcalls,
- and a storage stack that can be driven safely during memory pressure.
Interface and Lifecycle Changes
This proposal implies a few interface changes, even if the exact schema names change later.
Process Exit Reporting
Supervisors need to know whether a child:
- exited normally,
- hit a capability exception,
- faulted on memory corruption,
- or died because memory pressure could not be satisfied.
That argues for a typed exit record rather than flattening everything into one numeric code.
Spawn-Time Memory Budgets
ProcessSpawner should eventually accept resource limits, including a memory
budget, rather than assuming every child competes in one shared frame pool.
Monitoring
A future monitoring/status surface should expose at least:
- committed pages,
- resident pages,
- pinned pages,
- swapped pages,
- swap I/O failures,
- reclaim counts,
- and per-process OOM termination counts.
Without that, operators will not be able to distinguish “the child leaked heap” from “the kernel pinned too much unswappable state.”
Security Requirements
Memory-pressure code is security-sensitive, not just performance-sensitive.
Required properties:
- reclaim and swap metadata operations are bounded and fail closed,
- swap ciphertext is authenticated, not just encrypted,
- freed swap slots cannot be read by another process,
- secret/pinned mappings never spill to swap,
- swap enable/disable transitions do not expose stale plaintext,
- and pressure paths avoid allocation where possible.
The last point matters because allocating heap memory while handling OOM is how systems spiral into recursive failure and panic surfaces.
Relationship to Existing Proposals
- Error Handling: resource exhaustion should map to
overloadedor typed OOM results at explicit call boundaries, not generic panic text. - Service Architecture: parents and supervisors should own memory budgets just as they own capability grants.
- Storage and Naming: swap should use explicit backing extents, not ambient filesystem paths.
- Volume Encryption / Key Management: swap encryption uses a per-boot ephemeral symmetric key; persistent encryption keys are unnecessary for the first design.
Phases
Phase 0: Normalize Runtime OOM Semantics
- Remove remaining runtime panic surfaces on untrusted allocation paths.
- Distinguish boot-fatal OOM from service-facing
overloaded. - Add typed process-exit reporting for OOM and faulted swap-in.
Phase 1: Budgeted Anonymous Memory
- Add spawn-time memory budgets.
- Charge anonymous
VirtualMemory.commitand compatibilityVirtualMemory.mapagainst committed-page budget. - Charge anonymous
VirtualMemory.reserveagainst virtual address-space quota. - Mark pinned vs. swappable vs. secret mappings explicitly.
Phase 2: Reclaim Without Swap
- Add clean-cache reclaim and dead-process cleanup accounting.
- Expose pressure metrics and events.
- Keep allocation failure deterministic when reclaim cannot help.
Phase 3: Encrypted Kernel-Managed Swap
- Add dedicated swap extent provisioning.
- Add encrypted/authenticated page slots with per-boot ephemeral keying.
- Support swap for anonymous private pages only.
- Terminate the faulting process cleanly when swap-in cannot succeed.
Phase 4: Optional External Pager
- Revisit pager upcalls only after notifications, richer lifecycle control, and storage-stack maturity exist.
- Keep the kernel fault path bounded even if policy moves outward.
Open Questions
- Should capOS ever add demand commit on first access after the explicit
reserve/commitcontract, or should runtime allocators keep making commitment visible through capability calls? - Should executable anonymous pages be swappable in phase 1, or should swap be limited to writable anonymous pages until code-loading semantics mature?
- When
MemoryObjectgrows richer sharing semantics, should some subclasses be reclaimable-from-backing rather than unswappable? - Does a future
secretmapping need stronger guarantees than “never swap,” such as forced zero-on-fork, no-core-dump, and cache-flush hooks? - How much kernel memory should remain permanently reserved before the system starts admitting user commitments?
Bottom Line
capOS should treat OOM as an authority and lifecycle problem, not as a last-gap allocator surprise. The default system should use explicit budgets and no overcommit, return typed exhaustion at API boundaries, reserve process death only for unsatisfied execution faults, and add encrypted swap later as a narrow extension for anonymous private pages.