Proposal: Memory Authority Model

capOS already has implemented memory management. This proposal defines the missing cross-cutting contract: which capability may create or hold memory, which memory may move or be reclaimed, when a mapping mutation is complete, and which proofs are required before shared-memory, DMA, swap, and language-runtime features build on that substrate.

This is deliberately not a CPU or language memory-ordering model. Atomic ordering, cache coherence, and Rust aliasing rules remain their own topics. This page covers OS memory authority, residency, consistency, and lifecycle.

Memory Management documents current physical frames, address spaces, user-buffer validation, VirtualMemory, and MemoryObject.
Go VirtualMemory Contract freezes the reserve/commit/decommit contract this proposal treats as the current anonymous-memory baseline.
OOM Handling and Swap owns memory-pressure, OOM, reclaim, and encrypted swap policy. That proposal consumes the authority/residency vocabulary defined here; this proposal does not re-specify reclaim or swap policy.
Capability-Based Service Architecture owns authority-at-spawn, service composition, and the per-service capability graph that selects which memory authorities (anonymous VirtualMemory, MemoryObject, future pin/DMA/swap caps) a service receives. The classes and state machines below are the contract that service-graph budgeting and spawn-time memory grants must respect.
Resource Accounting and Quotas owns the one-ledger-of-record and reserve/commit/rollback/release vocabulary used by the accounting rules below.
Design Risks Register R4 tracks the consolidated open gap in cross-service donation/fairness, log volume accounting, and the memory authority/residency proof obligations this proposal must close before downstream features may depend on them.
DMA Isolation owns device-visible memory, IOMMU, stale DMA, and scrub-before-reuse requirements.
Park Authority records why shared park words need mapping identity or object pins before they can be safe across processes.
Memory Authority Model Backlog decomposes the research, design, and proof work behind this proposal.

Problem

The current tree has strong local rules, but they are spread across several documents and implementations:

anonymous VirtualMemory ranges separate reservation from physical commit;
MemoryObject caps can be shared and mapped into multiple address spaces;
user-buffer copies validate and use pointers while holding the address-space lock;
TLB shootdown is routed through address-space residency masks;
private ParkSpace cleanup handles anonymous unmap/decommit and explicit MemoryObject.unmap;
DMA isolation requires resident unswappable pages, generations, quiesce, and scrub-before-reuse;
OOM policy rejects default overcommit and forbids an ambient OOM killer.

Those rules are individually useful, but future work needs a single answer to questions such as:

Is this page only reserved, physically committed, resident, pinned, swapped, or device-visible?
Which cap or ledger is the authority that made it that way?
Can this backing frame be reclaimed, unmapped, shared, donated, or exposed to a device?
When is it safe to reuse a frame after unmap, decommit, protect, process exit, revoke, or failed rollback?
Which proof must exist before a shared park word, NIC ring, block buffer, Store blob, GPU buffer, Go heap arena, or swap slot depends on the rule?

Without a consolidated contract, later features can accidentally treat MemoryObject, SharedBuffer, DMA pool pages, anonymous VM pages, and swapped pages as interchangeable. They are not interchangeable authority classes.

Design Grounding

The project grounding for this proposal is:

docs/architecture/memory.md
docs/backlog/go-virtual-memory-contract.md
docs/proposals/oom-and-swap-proposal.md
docs/proposals/resource-accounting-proposal.md
docs/proposals/service-architecture-proposal.md
docs/design-risks-register.md (entry R4 – Resource accounting is fragmented)
docs/dma-isolation-design.md
docs/architecture/park.md
docs/architecture/scheduling.md
docs/architecture/userspace-runtime.md
docs/proposals/storage-and-naming-proposal.md
docs/proposals/go-runtime-proposal.md
docs/security/verification-workflow.md
REVIEW.md

Relevant research grounding:

docs/research/zircon.md for VMO/VMAR separation, mapping authority, and VMO commit/decommit precedent.
docs/research/genode.md for parent-routed sessions and resource quota discipline.
docs/research/sel4.md for explicit authority, no ambient allocation authority, and the proof value of small stable object invariants.
docs/research/eros-capros-coyotos.md for single-level-store and keeper lessons; capOS should not make transparent persistence or implicit paging the baseline by accident.
docs/research/llvm-target.md for Go/runtime memory hooks that require reserve, commit, decommit, and fault behavior.

Goals

Define the authoritative state vocabulary for user memory, shared memory, pinned memory, DMA memory, and future swapped memory.
Make every memory-state transition name the capability and ledger that authorizes it.
Specify when validation, mapping mutation, TLB shootdown, cleanup, and frame reuse are complete.
Keep future SharedBuffer, file, network, GPU, and DMA paths from bypassing MemoryObject provenance or residency rules.
Tie memory-pressure and OOM behavior to explicit budgets and process lifecycle, not allocator surprises.
Convert this contract into host tests, QEMU smokes, Kani models, and review gates.

Non-Goals

No demand paging or swap implementation in this proposal.
No transparent global persistence or EROS-style single-level store.
No copy-on-write clone design.
No sub-VMAR or nested address-region capability in the first version.
No generic userspace pager in the page-fault path.
No change to CPU atomic ordering, language memory models, or Rust aliasing rules.

Memory Authority Classes

The model starts by naming memory by authority and residency, not by how it is convenient for one subsystem to access it.

Class	Authority	Backing	Reclaim/swap	Notes
Reserved anonymous VA	`VirtualMemory` bound to one address space	No frame	Not reclaimable; no physical state	Charges virtual-reservation quota only.
Committed anonymous page	`VirtualMemory` plus frame-grant budget	Private frame	Reclaim/swap only under future explicit policy	`VM_PROT_NONE` is committed but inaccessible.
Borrowed `MemoryObject` mapping	`MemoryObject` cap plus caller address space	Shared object backing	Not phase-1 swap	Mapping pins backing lifetime and charges mapping budget.
Capability transport page	Kernel bootstrap/ring setup	Kernel-owned frame mapped into user VA	Never swap	Ring and CapSet pages are reserved outside caller VM control.
Private kernel metadata	Kernel boot/process/scheduler/cap state	Kernel heap or frames	Never swap	Failure is kernel capacity pressure unless later object funding exists.
Reclaimable clean cache	Store, boot package, file, or loader service	Reconstructable backing	Drop/refetch, not swap	Must have trusted backing identity before reclaim.
Pinned user page	Explicit pin authority or future residency option	Resident frame	Never swap while pinned	Needed for shared park, DMA staging, and secret/locked mappings.
DMA-visible page	`DMAPool`/device-manager authority	Resident frame or IOVA mapping	Never swap; quiesce before free	Device bytes remain untrusted until driver validation.
Secret page	Future secret-memory authority	Resident private frame	Never swap; scrub aggressively	Stronger policy than ordinary pinned memory.
Swapped anonymous page	`VirtualMemory` plus swap budget and slot metadata	Encrypted/authenticated slot	Non-resident until fault	Future phase only; anonymous private pages first.

Two rules follow from this table:

A capability that maps CPU-visible bytes does not automatically grant device visibility, swap eligibility, pin authority, or persistence.
A residency promise is an authority and accounting event, not a flag that a helper may set opportunistically.

State Machines

Anonymous `VirtualMemory`

Anonymous memory follows this lifecycle:

unreserved
reserved
committed inaccessible (VM_PROT_NONE) or committed accessible
decommitted reserved
unreserved

Required properties:

reserve charges virtual reservation and installs no present user PTE.
commit is all-or-nothing: frame allocation, ledger charge, page-table mutation, committed-page metadata, and deferred TLB completion reservation either all become visible or all roll back.
protect changes accessibility only for committed pages; it does not create backing.
decommit releases committed frames and physical charges while preserving virtual reservation.
unmap releases the reservation and any committed backing inside it.
ring and CapSet virtual ranges remain outside caller-controlled VirtualMemory operations.

`MemoryObject`

MemoryObject backing has a separate lifecycle from any one mapping:

allocated by FrameAllocator
held through one or more process cap-table entries
borrowed into one or more address spaces by MemoryObject.map
unmapped or torn down from those address spaces
released after the final cap and final borrowed mapping drop their holds

Required properties:

held MemoryObject caps charge holder frame-grant pages;
each borrowed mapping charges the mapper while it remains mapped;
object-backed pages are tracked separately from anonymous reservations;
unmap/protect must prove the affected range is backed by the same object;
releasing a cap cannot leave uncharged live mapped backing behind;
future direct read/write, slice, clone, file-backed, or DMA-backed subclasses must define how they differ from the current physically backed object.

Pinned and Device-Visible Memory

Pinned memory is not just a mapped page with a refcount. It is a promise that reclaim, swap, and frame reuse will not move or free the page until the pin is released.

The first reusable state machine should be:

resident
pin_reserved
pinned
pin_draining
resident

DMA-visible memory extends this with device ownership:

allocated
mapped_to_device
submitted
quiescing
device_mapping_removed
scrubbed
released

The device state machine must advance generations or epochs before reuse so stale handles, stale interrupts, and stale completions fail closed.

Future Swap

Swap is a later extension of committed anonymous memory:

committed resident
eviction reserved
encrypted slot written and authenticated
committed swapped
faulting restore
committed resident

No phase-1 swap path may include MemoryObject backing, shared IPC pages, ring/CapSet pages, DMA pages, kernel metadata, or secret pages. If swap-in cannot restore data or authenticate the slot, the faulting process exits with a typed OOM or corruption reason; the kernel must not fabricate contents.

Mapping Consistency

The consistency rule is:

A frame may return to the allocator, be exposed to a different authority, or be made device-visible only after every stale CPU and device observer has been invalidated or made irrelevant by generation.

For current CPU mappings this means:

address-space mutation holds the address-space lock while checking and changing metadata;
local page-table changes flush locally before returning to user mode;
remote CPUs that have run the address space receive and acknowledge the required TLB shootdown generation before freed frames can be reused;
kernel upper-half mappings that share a top-level PML4 entry with existing address spaces mutate only shared lower-level tables and use kernel-wide TLB shootdown; a new kernel-half PML4 slot must either be pre-seeded before user address-space creation or fail closed until a synchronized live-root propagation path exists;
cleanup paths reserve deferred completion slots before mutation when they may need to free frames after shootdown;
private ParkSpace waiters are interrupted before an unmapped virtual address can be reused for unrelated state.

For future shared keys, DMA, and swap this means:

shared park keys must be derived from MemoryObject identity and offset, or the backing object must be explicitly pinned for the duration of key derivation and wait registration;
DMA page reuse requires descriptor quiesce, interrupt/completion generation checks, IOMMU or bounce-buffer invalidation, and scrub-before-release;
swapped pages carry slot generation/integrity metadata, and stale slots must not be accepted as current backing.

Accounting Rules

Every state transition must name exactly one ledger of record:

virtual reservation pages: process/address-space VM ledger;
anonymous committed pages: process frame-grant or future memory-commit ledger;
held MemoryObject backing: holder frame-grant ledger;
borrowed MemoryObject mappings: mapper frame-grant ledger plus address-space tracking;
pinned pages: future pin ledger owned by the pinning authority;
DMA pool bytes, DMA buffers, descriptors, MMIO mappings, interrupt holds, and in-flight DMA submissions: device-manager ledger;
swap commitment and swap slots: future swap/budget ledger;
kernel metadata: kernel capacity budget until explicit object funding exists.

Status views, metrics, and audit logs may mirror these values, but they are not allowed to grant or deny resources from stale copies.

Failure Semantics

Capability-call boundaries return controlled failures:

malformed ranges, overflow, unsupported protection, and authority mismatch return deterministic validation errors;
local budget exhaustion returns quota denial or typed OOM;
global pressure may attempt reclaim first if the memory class is eligible;
rollback failure enters a bounded recovery or emergency path rather than publishing half-mutated authority.

Execution faults are process lifecycle events:

access to unreserved memory is an ordinary fault;
access to reserved uncommitted memory is a reservation fault, not implicit demand commit;
access to committed VM_PROT_NONE is a protection fault;
future failed swap-in or failed zero-fill terminates the faulting process, not a random victim.

Boot-time core allocation failures remain boot-fatal until capOS moves kernel object memory under explicit authority.

Proof Obligations

Before a feature depends on this model, it must add or cite the relevant proof:

Area	Required evidence
Frame allocator	Host/Kani proof that allocator metadata frames are never returned, allocation is unique, and free rejects invalid or double-free inputs.
Anonymous VM	Host tests for overlap, split, rollback, quota, `VM_PROT_NONE`, decommit/recommit zero-fill, and process teardown.
TLB/frame reuse	QEMU or targeted instrumentation proving unmap/decommit/protect wait for required shootdown before frame reuse on resident CPUs.
User-buffer copy	Review and tests showing validation and copy/read occur under one address-space stability guarantee.
`MemoryObject` sharing	QEMU proof that mapping, transfer, writes, unmap, release, and teardown preserve backing lifetime and accounting.
Shared park words	Proof that key derivation uses object identity and stale waiters cannot observe reused virtual addresses or object generations.
DMA	QEMU/host proof that stale handles, interrupts, and completions cannot affect new owners or freed buffers.
OOM/quota	Hostile exhaustion tests showing controlled errors, rollback, no leaked frames, and no partial cap publication.
Swap	Future proof for encrypted/authenticated slots, stale-slot rejection, pinned/secret/DMA exclusion, and faulting-process termination.

Kani should stay focused on bounded pure primitives in capos-lib. Hardware-dependent behavior such as TLB shootdown, page faults, and DMA requires QEMU or targeted integration evidence.

Phasing

Phase 0: Contract and Inventory

Land this proposal and the backlog.
Inventory existing memory-state transitions and match them to the classes above.
Identify code paths whose errors currently blur validation failure, quota denial, and global pressure.

Phase 1: Host-Testable Ownership Model

Extract or mirror sparse anonymous reservation behavior into host-testable logic where useful.
Add model tests for reservation split/merge, committed-page bookkeeping, borrowed mapping provenance, and one-ledger-of-record accounting.

Phase 2: Shared-Memory Provenance and Pins

Define MemoryObject mapping identity sufficient for SharedParkSpace, service-owned shared buffers, and zero-copy file/network paths.
Add explicit object pins or mapping pins only where a locked copy/read is not enough.

Phase 3: Runtime Budgets and OOM Normalization

Add spawn-time memory budget policy once the selected milestone sequence reaches resource-profile work.
Normalize allocation failure results at capability boundaries.
Add typed process exit reporting for memory faults and future OOM.

Phase 4: DMA and Swap Extensions

Treat userspace drivers as blocked until DMA owner states, generations, ledgers, and stale-completion proofs exist.
Treat swap as blocked until page classes, slot metadata, encryption, and fault lifecycle are explicit.

Open Questions

Should capOS expose a first-class pinned-memory option on VirtualMemory, or only through narrower future caps such as SharedParkSpace, DMAPool, and SecretMemory?
Should MemoryObject gain direct read/write and slice operations before file/network SharedBuffer APIs, or stay mapping-only until a concrete service needs the API?
Which pure address-space interval logic should move into capos-lib for Kani/host testing without dragging in architecture-specific page-table details?
What is the smallest status surface that exposes reservation, commit, resident, pinned, borrowed, DMA, and swapped counts without creating a second enforcement counter?
How much kernel metadata should remain permanently reserved before capOS adds explicit object-funding authority for kernel allocations?

Bottom Line

capOS should treat memory state as capability authority plus ledgered residency. Anonymous reservations, committed frames, shared objects, pins, DMA pages, and swapped pages need distinct contracts. The near-term path is not to implement swap or a pager; it is to make the authority, accounting, TLB/device-observer, cleanup, and proof rules precise enough that future shared-memory and device work cannot accidentally bypass them.

Keyboard shortcuts

capOS Documentation