# Proposal: Memory Authority Model

capOS already has implemented memory management. This proposal defines the
missing cross-cutting contract: which capability may create or hold memory,
which memory may move or be reclaimed, when a mapping mutation is complete, and
which proofs are required before shared-memory, DMA, swap, and language-runtime
features build on that substrate.

This is deliberately not a CPU or language memory-ordering model. Atomic
ordering, cache coherence, and Rust aliasing rules remain their own topics.
This page covers OS memory authority, residency, consistency, and lifecycle.


## Related

- [Memory Management](../architecture/memory.md) documents current physical
  frames, address spaces, user-buffer validation, `VirtualMemory`, and
  `MemoryObject`.
- [Go VirtualMemory Contract](../backlog/go-virtual-memory-contract.md)
  freezes the reserve/commit/decommit contract this proposal treats as the
  current anonymous-memory baseline.
- [OOM Handling and Swap](oom-and-swap-proposal.md) owns memory-pressure,
  OOM, reclaim, and encrypted swap policy.
- [Resource Accounting and Quotas](resource-accounting-proposal.md) owns the
  one-ledger-of-record and reserve/commit/rollback/release vocabulary.
- [DMA Isolation](../dma-isolation-design.md) owns device-visible memory,
  IOMMU, stale DMA, and scrub-before-reuse requirements.
- [Park Authority](../architecture/park.md) records why shared park words need
  mapping identity or object pins before they can be safe across processes.
- [Memory Authority Model Backlog](../backlog/memory-authority-model.md)
  decomposes the research, design, and proof work behind this proposal.


## Problem

The current tree has strong local rules, but they are spread across several
documents and implementations:

- anonymous `VirtualMemory` ranges separate reservation from physical commit;
- `MemoryObject` caps can be shared and mapped into multiple address spaces;
- user-buffer copies validate and use pointers while holding the address-space
  lock;
- TLB shootdown is routed through address-space residency masks;
- private ParkSpace cleanup handles anonymous unmap/decommit and explicit
  `MemoryObject.unmap`;
- DMA isolation requires resident unswappable pages, generations, quiesce, and
  scrub-before-reuse;
- OOM policy rejects default overcommit and forbids an ambient OOM killer.

Those rules are individually useful, but future work needs a single answer to
questions such as:

- Is this page only reserved, physically committed, resident, pinned, swapped,
  or device-visible?
- Which cap or ledger is the authority that made it that way?
- Can this backing frame be reclaimed, unmapped, shared, donated, or exposed to
  a device?
- When is it safe to reuse a frame after unmap, decommit, protect, process
  exit, revoke, or failed rollback?
- Which proof must exist before a shared park word, NIC ring, block buffer,
  Store blob, GPU buffer, Go heap arena, or swap slot depends on the rule?

Without a consolidated contract, later features can accidentally treat
`MemoryObject`, `SharedBuffer`, DMA pool pages, anonymous VM pages, and swapped
pages as interchangeable. They are not interchangeable authority classes.


## Design Grounding

The project grounding for this proposal is:

- `docs/architecture/memory.md`
- `docs/backlog/go-virtual-memory-contract.md`
- `docs/proposals/oom-and-swap-proposal.md`
- `docs/proposals/resource-accounting-proposal.md`
- `docs/dma-isolation-design.md`
- `docs/architecture/park.md`
- `docs/architecture/scheduling.md`
- `docs/architecture/userspace-runtime.md`
- `docs/proposals/storage-and-naming-proposal.md`
- `docs/proposals/go-runtime-proposal.md`
- `docs/security/verification-workflow.md`
- `REVIEW.md`
- `REVIEW_FINDINGS.md`

Relevant research grounding:

- `docs/research/zircon.md` for VMO/VMAR separation, mapping authority, and
  VMO commit/decommit precedent.
- `docs/research/genode.md` for parent-routed sessions and resource quota
  discipline.
- `docs/research/sel4.md` for explicit authority, no ambient allocation
  authority, and the proof value of small stable object invariants.
- `docs/research/eros-capros-coyotos.md` for single-level-store and keeper
  lessons; capOS should not make transparent persistence or implicit paging the
  baseline by accident.
- `docs/research/llvm-target.md` for Go/runtime memory hooks that require
  reserve, commit, decommit, and fault behavior.


## Goals

1. Define the authoritative state vocabulary for user memory, shared memory,
   pinned memory, DMA memory, and future swapped memory.
2. Make every memory-state transition name the capability and ledger that
   authorizes it.
3. Specify when validation, mapping mutation, TLB shootdown, cleanup, and frame
   reuse are complete.
4. Keep future `SharedBuffer`, file, network, GPU, and DMA paths from bypassing
   `MemoryObject` provenance or residency rules.
5. Tie memory-pressure and OOM behavior to explicit budgets and process
   lifecycle, not allocator surprises.
6. Convert this contract into host tests, QEMU smokes, Kani models, and review
   gates.


## Non-Goals

- No demand paging or swap implementation in this proposal.
- No transparent global persistence or EROS-style single-level store.
- No copy-on-write clone design.
- No sub-VMAR or nested address-region capability in the first version.
- No generic userspace pager in the page-fault path.
- No change to CPU atomic ordering, language memory models, or Rust aliasing
  rules.


## Memory Authority Classes

The model starts by naming memory by authority and residency, not by how it is
convenient for one subsystem to access it.

| Class | Authority | Backing | Reclaim/swap | Notes |
| --- | --- | --- | --- | --- |
| Reserved anonymous VA | `VirtualMemory` bound to one address space | No frame | Not reclaimable; no physical state | Charges virtual-reservation quota only. |
| Committed anonymous page | `VirtualMemory` plus frame-grant budget | Private frame | Reclaim/swap only under future explicit policy | `VM_PROT_NONE` is committed but inaccessible. |
| Borrowed `MemoryObject` mapping | `MemoryObject` cap plus caller address space | Shared object backing | Not phase-1 swap | Mapping pins backing lifetime and charges mapping budget. |
| Capability transport page | Kernel bootstrap/ring setup | Kernel-owned frame mapped into user VA | Never swap | Ring and CapSet pages are reserved outside caller VM control. |
| Private kernel metadata | Kernel boot/process/scheduler/cap state | Kernel heap or frames | Never swap | Failure is kernel capacity pressure unless later object funding exists. |
| Reclaimable clean cache | Store, boot package, file, or loader service | Reconstructable backing | Drop/refetch, not swap | Must have trusted backing identity before reclaim. |
| Pinned user page | Explicit pin authority or future residency option | Resident frame | Never swap while pinned | Needed for shared park, DMA staging, and secret/locked mappings. |
| DMA-visible page | `DMAPool`/device-manager authority | Resident frame or IOVA mapping | Never swap; quiesce before free | Device bytes remain untrusted until driver validation. |
| Secret page | Future secret-memory authority | Resident private frame | Never swap; scrub aggressively | Stronger policy than ordinary pinned memory. |
| Swapped anonymous page | `VirtualMemory` plus swap budget and slot metadata | Encrypted/authenticated slot | Non-resident until fault | Future phase only; anonymous private pages first. |

Two rules follow from this table:

- A capability that maps CPU-visible bytes does not automatically grant device
  visibility, swap eligibility, pin authority, or persistence.
- A residency promise is an authority and accounting event, not a flag that a
  helper may set opportunistically.


## State Machines

### Anonymous `VirtualMemory`

Anonymous memory follows this lifecycle:

1. `unreserved`
2. `reserved`
3. `committed inaccessible` (`VM_PROT_NONE`) or `committed accessible`
4. `decommitted reserved`
5. `unreserved`

Required properties:

- `reserve` charges virtual reservation and installs no present user PTE.
- `commit` is all-or-nothing: frame allocation, ledger charge, page-table
  mutation, committed-page metadata, and deferred TLB completion reservation
  either all become visible or all roll back.
- `protect` changes accessibility only for committed pages; it does not create
  backing.
- `decommit` releases committed frames and physical charges while preserving
  virtual reservation.
- `unmap` releases the reservation and any committed backing inside it.
- ring and CapSet virtual ranges remain outside caller-controlled
  `VirtualMemory` operations.

### `MemoryObject`

`MemoryObject` backing has a separate lifecycle from any one mapping:

1. allocated by `FrameAllocator`
2. held through one or more process cap-table entries
3. borrowed into one or more address spaces by `MemoryObject.map`
4. unmapped or torn down from those address spaces
5. released after the final cap and final borrowed mapping drop their holds

Required properties:

- held `MemoryObject` caps charge holder frame-grant pages;
- each borrowed mapping charges the mapper while it remains mapped;
- object-backed pages are tracked separately from anonymous reservations;
- unmap/protect must prove the affected range is backed by the same object;
- releasing a cap cannot leave uncharged live mapped backing behind;
- future direct read/write, slice, clone, file-backed, or DMA-backed subclasses
  must define how they differ from the current physically backed object.

### Pinned and Device-Visible Memory

Pinned memory is not just a mapped page with a refcount. It is a promise that
reclaim, swap, and frame reuse will not move or free the page until the pin is
released.

The first reusable state machine should be:

1. `resident`
2. `pin_reserved`
3. `pinned`
4. `pin_draining`
5. `resident`

DMA-visible memory extends this with device ownership:

1. `allocated`
2. `mapped_to_device`
3. `submitted`
4. `quiescing`
5. `device_mapping_removed`
6. `scrubbed`
7. `released`

The device state machine must advance generations or epochs before reuse so
stale handles, stale interrupts, and stale completions fail closed.

### Future Swap

Swap is a later extension of committed anonymous memory:

1. `committed resident`
2. `eviction reserved`
3. `encrypted slot written and authenticated`
4. `committed swapped`
5. `faulting restore`
6. `committed resident`

No phase-1 swap path may include `MemoryObject` backing, shared IPC pages,
ring/CapSet pages, DMA pages, kernel metadata, or secret pages. If swap-in
cannot restore data or authenticate the slot, the faulting process exits with
a typed OOM or corruption reason; the kernel must not fabricate contents.


## Mapping Consistency

The consistency rule is:

> A frame may return to the allocator, be exposed to a different authority, or
> be made device-visible only after every stale CPU and device observer has
> been invalidated or made irrelevant by generation.

For current CPU mappings this means:

- address-space mutation holds the address-space lock while checking and
  changing metadata;
- local page-table changes flush locally before returning to user mode;
- remote CPUs that have run the address space receive and acknowledge the
  required TLB shootdown generation before freed frames can be reused;
- cleanup paths reserve deferred completion slots before mutation when they
  may need to free frames after shootdown;
- private ParkSpace waiters are interrupted before an unmapped virtual address
  can be reused for unrelated state.

For future shared keys, DMA, and swap this means:

- shared park keys must be derived from `MemoryObject` identity and offset, or
  the backing object must be explicitly pinned for the duration of key
  derivation and wait registration;
- DMA page reuse requires descriptor quiesce, interrupt/completion generation
  checks, IOMMU or bounce-buffer invalidation, and scrub-before-release;
- swapped pages carry slot generation/integrity metadata, and stale slots must
  not be accepted as current backing.


## Accounting Rules

Every state transition must name exactly one ledger of record:

- virtual reservation pages: process/address-space VM ledger;
- anonymous committed pages: process frame-grant or future memory-commit
  ledger;
- held `MemoryObject` backing: holder frame-grant ledger;
- borrowed `MemoryObject` mappings: mapper frame-grant ledger plus
  address-space tracking;
- pinned pages: future pin ledger owned by the pinning authority;
- DMA pool bytes, DMA buffers, descriptors, MMIO mappings, interrupt holds,
  and in-flight DMA submissions: device-manager ledger;
- swap commitment and swap slots: future swap/budget ledger;
- kernel metadata: kernel capacity budget until explicit object funding exists.

Status views, metrics, and audit logs may mirror these values, but they are not
allowed to grant or deny resources from stale copies.


## Failure Semantics

Capability-call boundaries return controlled failures:

- malformed ranges, overflow, unsupported protection, and authority mismatch
  return deterministic validation errors;
- local budget exhaustion returns quota denial or typed OOM;
- global pressure may attempt reclaim first if the memory class is eligible;
- rollback failure enters a bounded recovery or emergency path rather than
  publishing half-mutated authority.

Execution faults are process lifecycle events:

- access to unreserved memory is an ordinary fault;
- access to reserved uncommitted memory is a reservation fault, not implicit
  demand commit;
- access to committed `VM_PROT_NONE` is a protection fault;
- future failed swap-in or failed zero-fill terminates the faulting process,
  not a random victim.

Boot-time core allocation failures remain boot-fatal until capOS moves kernel
object memory under explicit authority.


## Proof Obligations

Before a feature depends on this model, it must add or cite the relevant proof:

| Area | Required evidence |
| --- | --- |
| Frame allocator | Host/Kani proof that allocator metadata frames are never returned, allocation is unique, and free rejects invalid or double-free inputs. |
| Anonymous VM | Host tests for overlap, split, rollback, quota, `VM_PROT_NONE`, decommit/recommit zero-fill, and process teardown. |
| TLB/frame reuse | QEMU or targeted instrumentation proving unmap/decommit/protect wait for required shootdown before frame reuse on resident CPUs. |
| User-buffer copy | Review and tests showing validation and copy/read occur under one address-space stability guarantee. |
| `MemoryObject` sharing | QEMU proof that mapping, transfer, writes, unmap, release, and teardown preserve backing lifetime and accounting. |
| Shared park words | Proof that key derivation uses object identity and stale waiters cannot observe reused virtual addresses or object generations. |
| DMA | QEMU/host proof that stale handles, interrupts, and completions cannot affect new owners or freed buffers. |
| OOM/quota | Hostile exhaustion tests showing controlled errors, rollback, no leaked frames, and no partial cap publication. |
| Swap | Future proof for encrypted/authenticated slots, stale-slot rejection, pinned/secret/DMA exclusion, and faulting-process termination. |

Kani should stay focused on bounded pure primitives in `capos-lib`.
Hardware-dependent behavior such as TLB shootdown, page faults, and DMA
requires QEMU or targeted integration evidence.


## Phasing

### Phase 0: Contract and Inventory

- Land this proposal and the backlog.
- Inventory existing memory-state transitions and match them to the classes
  above.
- Identify code paths whose errors currently blur validation failure,
  quota denial, and global pressure.

### Phase 1: Host-Testable Ownership Model

- Extract or mirror sparse anonymous reservation behavior into host-testable
  logic where useful.
- Add model tests for reservation split/merge, committed-page bookkeeping,
  borrowed mapping provenance, and one-ledger-of-record accounting.

### Phase 2: Shared-Memory Provenance and Pins

- Define `MemoryObject` mapping identity sufficient for `SharedParkSpace`,
  service-owned shared buffers, and zero-copy file/network paths.
- Add explicit object pins or mapping pins only where a locked copy/read is not
  enough.

### Phase 3: Runtime Budgets and OOM Normalization

- Add spawn-time memory budget policy once the selected milestone sequence
  reaches resource-profile work.
- Normalize allocation failure results at capability boundaries.
- Add typed process exit reporting for memory faults and future OOM.

### Phase 4: DMA and Swap Extensions

- Treat userspace drivers as blocked until DMA owner states, generations,
  ledgers, and stale-completion proofs exist.
- Treat swap as blocked until page classes, slot metadata, encryption, and
  fault lifecycle are explicit.


## Open Questions

1. Should capOS expose a first-class pinned-memory option on `VirtualMemory`, or
   only through narrower future caps such as `SharedParkSpace`, `DMAPool`, and
   `SecretMemory`?
2. Should `MemoryObject` gain direct read/write and slice operations before
   file/network `SharedBuffer` APIs, or stay mapping-only until a concrete
   service needs the API?
3. Which pure address-space interval logic should move into `capos-lib` for
   Kani/host testing without dragging in architecture-specific page-table
   details?
4. What is the smallest status surface that exposes reservation, commit,
   resident, pinned, borrowed, DMA, and swapped counts without creating a second
   enforcement counter?
5. How much kernel metadata should remain permanently reserved before capOS
   adds explicit object-funding authority for kernel allocations?


## Bottom Line

capOS should treat memory state as capability authority plus ledgered
residency. Anonymous reservations, committed frames, shared objects, pins, DMA
pages, and swapped pages need distinct contracts. The near-term path is not to
implement swap or a pager; it is to make the authority, accounting,
TLB/device-observer, cleanup, and proof rules precise enough that future
shared-memory and device work cannot accidentally bypass them.
