Proposal: Memory Authority Model
capOS already has implemented memory management. This proposal defines the missing cross-cutting contract: which capability may create or hold memory, which memory may move or be reclaimed, when a mapping mutation is complete, and which proofs are required before shared-memory, DMA, swap, and language-runtime features build on that substrate.
This is deliberately not a CPU or language memory-ordering model. Atomic ordering, cache coherence, and Rust aliasing rules remain their own topics. This page covers OS memory authority, residency, consistency, and lifecycle.
Related
- Memory Management documents current physical
frames, address spaces, user-buffer validation,
VirtualMemory, andMemoryObject. - Go VirtualMemory Contract freezes the reserve/commit/decommit contract this proposal treats as the current anonymous-memory baseline.
- OOM Handling and Swap owns memory-pressure, OOM, reclaim, and encrypted swap policy.
- Resource Accounting and Quotas owns the one-ledger-of-record and reserve/commit/rollback/release vocabulary.
- DMA Isolation owns device-visible memory, IOMMU, stale DMA, and scrub-before-reuse requirements.
- Park Authority records why shared park words need mapping identity or object pins before they can be safe across processes.
- Memory Authority Model Backlog decomposes the research, design, and proof work behind this proposal.
Problem
The current tree has strong local rules, but they are spread across several documents and implementations:
- anonymous
VirtualMemoryranges separate reservation from physical commit; MemoryObjectcaps can be shared and mapped into multiple address spaces;- user-buffer copies validate and use pointers while holding the address-space lock;
- TLB shootdown is routed through address-space residency masks;
- private ParkSpace cleanup handles anonymous unmap/decommit and explicit
MemoryObject.unmap; - DMA isolation requires resident unswappable pages, generations, quiesce, and scrub-before-reuse;
- OOM policy rejects default overcommit and forbids an ambient OOM killer.
Those rules are individually useful, but future work needs a single answer to questions such as:
- Is this page only reserved, physically committed, resident, pinned, swapped, or device-visible?
- Which cap or ledger is the authority that made it that way?
- Can this backing frame be reclaimed, unmapped, shared, donated, or exposed to a device?
- When is it safe to reuse a frame after unmap, decommit, protect, process exit, revoke, or failed rollback?
- Which proof must exist before a shared park word, NIC ring, block buffer, Store blob, GPU buffer, Go heap arena, or swap slot depends on the rule?
Without a consolidated contract, later features can accidentally treat
MemoryObject, SharedBuffer, DMA pool pages, anonymous VM pages, and swapped
pages as interchangeable. They are not interchangeable authority classes.
Design Grounding
The project grounding for this proposal is:
docs/architecture/memory.mddocs/backlog/go-virtual-memory-contract.mddocs/proposals/oom-and-swap-proposal.mddocs/proposals/resource-accounting-proposal.mddocs/dma-isolation-design.mddocs/architecture/park.mddocs/architecture/scheduling.mddocs/architecture/userspace-runtime.mddocs/proposals/storage-and-naming-proposal.mddocs/proposals/go-runtime-proposal.mddocs/security/verification-workflow.mdREVIEW.mdREVIEW_FINDINGS.md
Relevant research grounding:
docs/research/zircon.mdfor VMO/VMAR separation, mapping authority, and VMO commit/decommit precedent.docs/research/genode.mdfor parent-routed sessions and resource quota discipline.docs/research/sel4.mdfor explicit authority, no ambient allocation authority, and the proof value of small stable object invariants.docs/research/eros-capros-coyotos.mdfor single-level-store and keeper lessons; capOS should not make transparent persistence or implicit paging the baseline by accident.docs/research/llvm-target.mdfor Go/runtime memory hooks that require reserve, commit, decommit, and fault behavior.
Goals
- Define the authoritative state vocabulary for user memory, shared memory, pinned memory, DMA memory, and future swapped memory.
- Make every memory-state transition name the capability and ledger that authorizes it.
- Specify when validation, mapping mutation, TLB shootdown, cleanup, and frame reuse are complete.
- Keep future
SharedBuffer, file, network, GPU, and DMA paths from bypassingMemoryObjectprovenance or residency rules. - Tie memory-pressure and OOM behavior to explicit budgets and process lifecycle, not allocator surprises.
- Convert this contract into host tests, QEMU smokes, Kani models, and review gates.
Non-Goals
- No demand paging or swap implementation in this proposal.
- No transparent global persistence or EROS-style single-level store.
- No copy-on-write clone design.
- No sub-VMAR or nested address-region capability in the first version.
- No generic userspace pager in the page-fault path.
- No change to CPU atomic ordering, language memory models, or Rust aliasing rules.
Memory Authority Classes
The model starts by naming memory by authority and residency, not by how it is convenient for one subsystem to access it.
| Class | Authority | Backing | Reclaim/swap | Notes |
|---|---|---|---|---|
| Reserved anonymous VA | VirtualMemory bound to one address space | No frame | Not reclaimable; no physical state | Charges virtual-reservation quota only. |
| Committed anonymous page | VirtualMemory plus frame-grant budget | Private frame | Reclaim/swap only under future explicit policy | VM_PROT_NONE is committed but inaccessible. |
Borrowed MemoryObject mapping | MemoryObject cap plus caller address space | Shared object backing | Not phase-1 swap | Mapping pins backing lifetime and charges mapping budget. |
| Capability transport page | Kernel bootstrap/ring setup | Kernel-owned frame mapped into user VA | Never swap | Ring and CapSet pages are reserved outside caller VM control. |
| Private kernel metadata | Kernel boot/process/scheduler/cap state | Kernel heap or frames | Never swap | Failure is kernel capacity pressure unless later object funding exists. |
| Reclaimable clean cache | Store, boot package, file, or loader service | Reconstructable backing | Drop/refetch, not swap | Must have trusted backing identity before reclaim. |
| Pinned user page | Explicit pin authority or future residency option | Resident frame | Never swap while pinned | Needed for shared park, DMA staging, and secret/locked mappings. |
| DMA-visible page | DMAPool/device-manager authority | Resident frame or IOVA mapping | Never swap; quiesce before free | Device bytes remain untrusted until driver validation. |
| Secret page | Future secret-memory authority | Resident private frame | Never swap; scrub aggressively | Stronger policy than ordinary pinned memory. |
| Swapped anonymous page | VirtualMemory plus swap budget and slot metadata | Encrypted/authenticated slot | Non-resident until fault | Future phase only; anonymous private pages first. |
Two rules follow from this table:
- A capability that maps CPU-visible bytes does not automatically grant device visibility, swap eligibility, pin authority, or persistence.
- A residency promise is an authority and accounting event, not a flag that a helper may set opportunistically.
State Machines
Anonymous VirtualMemory
Anonymous memory follows this lifecycle:
unreservedreservedcommitted inaccessible(VM_PROT_NONE) orcommitted accessibledecommitted reservedunreserved
Required properties:
reservecharges virtual reservation and installs no present user PTE.commitis all-or-nothing: frame allocation, ledger charge, page-table mutation, committed-page metadata, and deferred TLB completion reservation either all become visible or all roll back.protectchanges accessibility only for committed pages; it does not create backing.decommitreleases committed frames and physical charges while preserving virtual reservation.unmapreleases the reservation and any committed backing inside it.- ring and CapSet virtual ranges remain outside caller-controlled
VirtualMemoryoperations.
MemoryObject
MemoryObject backing has a separate lifecycle from any one mapping:
- allocated by
FrameAllocator - held through one or more process cap-table entries
- borrowed into one or more address spaces by
MemoryObject.map - unmapped or torn down from those address spaces
- released after the final cap and final borrowed mapping drop their holds
Required properties:
- held
MemoryObjectcaps charge holder frame-grant pages; - each borrowed mapping charges the mapper while it remains mapped;
- object-backed pages are tracked separately from anonymous reservations;
- unmap/protect must prove the affected range is backed by the same object;
- releasing a cap cannot leave uncharged live mapped backing behind;
- future direct read/write, slice, clone, file-backed, or DMA-backed subclasses must define how they differ from the current physically backed object.
Pinned and Device-Visible Memory
Pinned memory is not just a mapped page with a refcount. It is a promise that reclaim, swap, and frame reuse will not move or free the page until the pin is released.
The first reusable state machine should be:
residentpin_reservedpinnedpin_drainingresident
DMA-visible memory extends this with device ownership:
allocatedmapped_to_devicesubmittedquiescingdevice_mapping_removedscrubbedreleased
The device state machine must advance generations or epochs before reuse so stale handles, stale interrupts, and stale completions fail closed.
Future Swap
Swap is a later extension of committed anonymous memory:
committed residenteviction reservedencrypted slot written and authenticatedcommitted swappedfaulting restorecommitted resident
No phase-1 swap path may include MemoryObject backing, shared IPC pages,
ring/CapSet pages, DMA pages, kernel metadata, or secret pages. If swap-in
cannot restore data or authenticate the slot, the faulting process exits with
a typed OOM or corruption reason; the kernel must not fabricate contents.
Mapping Consistency
The consistency rule is:
A frame may return to the allocator, be exposed to a different authority, or be made device-visible only after every stale CPU and device observer has been invalidated or made irrelevant by generation.
For current CPU mappings this means:
- address-space mutation holds the address-space lock while checking and changing metadata;
- local page-table changes flush locally before returning to user mode;
- remote CPUs that have run the address space receive and acknowledge the required TLB shootdown generation before freed frames can be reused;
- cleanup paths reserve deferred completion slots before mutation when they may need to free frames after shootdown;
- private ParkSpace waiters are interrupted before an unmapped virtual address can be reused for unrelated state.
For future shared keys, DMA, and swap this means:
- shared park keys must be derived from
MemoryObjectidentity and offset, or the backing object must be explicitly pinned for the duration of key derivation and wait registration; - DMA page reuse requires descriptor quiesce, interrupt/completion generation checks, IOMMU or bounce-buffer invalidation, and scrub-before-release;
- swapped pages carry slot generation/integrity metadata, and stale slots must not be accepted as current backing.
Accounting Rules
Every state transition must name exactly one ledger of record:
- virtual reservation pages: process/address-space VM ledger;
- anonymous committed pages: process frame-grant or future memory-commit ledger;
- held
MemoryObjectbacking: holder frame-grant ledger; - borrowed
MemoryObjectmappings: mapper frame-grant ledger plus address-space tracking; - pinned pages: future pin ledger owned by the pinning authority;
- DMA pool bytes, DMA buffers, descriptors, MMIO mappings, interrupt holds, and in-flight DMA submissions: device-manager ledger;
- swap commitment and swap slots: future swap/budget ledger;
- kernel metadata: kernel capacity budget until explicit object funding exists.
Status views, metrics, and audit logs may mirror these values, but they are not allowed to grant or deny resources from stale copies.
Failure Semantics
Capability-call boundaries return controlled failures:
- malformed ranges, overflow, unsupported protection, and authority mismatch return deterministic validation errors;
- local budget exhaustion returns quota denial or typed OOM;
- global pressure may attempt reclaim first if the memory class is eligible;
- rollback failure enters a bounded recovery or emergency path rather than publishing half-mutated authority.
Execution faults are process lifecycle events:
- access to unreserved memory is an ordinary fault;
- access to reserved uncommitted memory is a reservation fault, not implicit demand commit;
- access to committed
VM_PROT_NONEis a protection fault; - future failed swap-in or failed zero-fill terminates the faulting process, not a random victim.
Boot-time core allocation failures remain boot-fatal until capOS moves kernel object memory under explicit authority.
Proof Obligations
Before a feature depends on this model, it must add or cite the relevant proof:
| Area | Required evidence |
|---|---|
| Frame allocator | Host/Kani proof that allocator metadata frames are never returned, allocation is unique, and free rejects invalid or double-free inputs. |
| Anonymous VM | Host tests for overlap, split, rollback, quota, VM_PROT_NONE, decommit/recommit zero-fill, and process teardown. |
| TLB/frame reuse | QEMU or targeted instrumentation proving unmap/decommit/protect wait for required shootdown before frame reuse on resident CPUs. |
| User-buffer copy | Review and tests showing validation and copy/read occur under one address-space stability guarantee. |
MemoryObject sharing | QEMU proof that mapping, transfer, writes, unmap, release, and teardown preserve backing lifetime and accounting. |
| Shared park words | Proof that key derivation uses object identity and stale waiters cannot observe reused virtual addresses or object generations. |
| DMA | QEMU/host proof that stale handles, interrupts, and completions cannot affect new owners or freed buffers. |
| OOM/quota | Hostile exhaustion tests showing controlled errors, rollback, no leaked frames, and no partial cap publication. |
| Swap | Future proof for encrypted/authenticated slots, stale-slot rejection, pinned/secret/DMA exclusion, and faulting-process termination. |
Kani should stay focused on bounded pure primitives in capos-lib.
Hardware-dependent behavior such as TLB shootdown, page faults, and DMA
requires QEMU or targeted integration evidence.
Phasing
Phase 0: Contract and Inventory
- Land this proposal and the backlog.
- Inventory existing memory-state transitions and match them to the classes above.
- Identify code paths whose errors currently blur validation failure, quota denial, and global pressure.
Phase 1: Host-Testable Ownership Model
- Extract or mirror sparse anonymous reservation behavior into host-testable logic where useful.
- Add model tests for reservation split/merge, committed-page bookkeeping, borrowed mapping provenance, and one-ledger-of-record accounting.
Phase 2: Shared-Memory Provenance and Pins
- Define
MemoryObjectmapping identity sufficient forSharedParkSpace, service-owned shared buffers, and zero-copy file/network paths. - Add explicit object pins or mapping pins only where a locked copy/read is not enough.
Phase 3: Runtime Budgets and OOM Normalization
- Add spawn-time memory budget policy once the selected milestone sequence reaches resource-profile work.
- Normalize allocation failure results at capability boundaries.
- Add typed process exit reporting for memory faults and future OOM.
Phase 4: DMA and Swap Extensions
- Treat userspace drivers as blocked until DMA owner states, generations, ledgers, and stale-completion proofs exist.
- Treat swap as blocked until page classes, slot metadata, encryption, and fault lifecycle are explicit.
Open Questions
- Should capOS expose a first-class pinned-memory option on
VirtualMemory, or only through narrower future caps such asSharedParkSpace,DMAPool, andSecretMemory? - Should
MemoryObjectgain direct read/write and slice operations before file/networkSharedBufferAPIs, or stay mapping-only until a concrete service needs the API? - Which pure address-space interval logic should move into
capos-libfor Kani/host testing without dragging in architecture-specific page-table details? - What is the smallest status surface that exposes reservation, commit, resident, pinned, borrowed, DMA, and swapped counts without creating a second enforcement counter?
- How much kernel metadata should remain permanently reserved before capOS adds explicit object-funding authority for kernel allocations?
Bottom Line
capOS should treat memory state as capability authority plus ledgered residency. Anonymous reservations, committed frames, shared objects, pins, DMA pages, and swapped pages need distinct contracts. The near-term path is not to implement swap or a pager; it is to make the authority, accounting, TLB/device-observer, cleanup, and proof rules precise enough that future shared-memory and device work cannot accidentally bypass them.