# Memory Management

Memory management gives the kernel controlled ownership of physical frames,
separates user processes, enforces page permissions, and exposes memory
authority only through explicit capabilities.


## Related

- [Go VirtualMemory Contract](../backlog/go-virtual-memory-contract.md)
  records the reserve/commit/decommit contract that extended the
  `VirtualMemory` implementation for Go-style arena allocation.
- [Userspace Runtime](userspace-runtime.md) describes the typed
  `VirtualMemoryClient` surface used by allocator and runtime code.
- [OOM Handling and Swap](../proposals/oom-and-swap-proposal.md) and
  [Resource Accounting and Quotas](../proposals/resource-accounting-proposal.md)
  define future memory-pressure and quota policy.
- [Memory Authority Model](../proposals/memory-authority-model-proposal.md)
  defines the future cross-cutting contract for memory authority classes,
  residency, mapping consistency, pins, DMA boundaries, swap eligibility, and
  proof obligations.

## Current Behavior

The frame allocator builds a bitmap from the Limine memory map, marks all
non-usable frames as used, reserves frame zero, and reserves its own bitmap
frames. The heap is initialized separately for kernel allocation.

Paging initialization builds a new kernel PML4, remaps kernel sections with
section-specific permissions, copies upper-half mappings with NX applied and
user access stripped, switches CR3, then enables page-global support. SMEP/SMAP
are enabled after those mappings are active.

Each user `AddressSpace` owns its lower-half page tables and clones the
kernel's upper-half mappings. Dropping an address space walks the user half and
frees mapped frames, committed anonymous frames retained behind `VM_PROT_NONE`,
and page-table frames. `VirtualMemory` lets a process reserve anonymous
address ranges, commit and decommit physical backing, unmap reservations, and
protect committed pages. Anonymous reservations charge the process virtual
reservation ledger. Committed anonymous pages charge
`ResourceLedger::frame_grant_pages`.

`FrameAllocator` allocation methods return a `MemoryObject` result capability,
not a physical address. The normal result payload carries the result-cap index,
and the CQE transfer-result record carries the local cap id plus
`MemoryObject` interface id. `MemoryObject.info` exposes page count and size;
`MemoryObject.map` maps page-aligned object ranges into the caller address
space, `MemoryObject.unmap` removes those borrowed mappings, and
`MemoryObject.protect` updates their page-table flags. Held `MemoryObject` caps
charge the holder's `frame_grant_pages` ledger, and final `CAP_OP_RELEASE` or
process exit frees the owned frames once no borrowed address-space mapping still
holds the backing alive.

## Design

The kernel keeps physical allocation host-testable by placing bitmap logic in
`capos-lib` and wrapping it with kernel HHDM access in `kernel/src/mem/frame.rs`.
Page-table manipulation stays in the kernel because it is architecture-specific.

ELF loading and `VirtualMemory` both use page-table flags to preserve W^X:
non-executable data gets NX, writable mappings are explicit, and userspace
pages must be `USER_ACCESSIBLE`. The CapSet and ring bootstrap pages occupy
reserved virtual pages; `VirtualMemory` rejects ranges that overlap either one.

User-buffer validation for process-owned buffers uses the process
`AddressSpace` mutex. The kernel checks that user pointers stay below the user
address limit, verifies page-table permissions for the requested read/write
access, and copies through the HHDM mapping while holding the same address-space
lock. This keeps validation and use tied to one stable page-table view. The
legacy current-CR3 validator remains only for callers that already provide an
equivalent page-table stability guarantee.

Committed `VirtualMemory` pages and held `MemoryObject` caps use the same
per-process frame-grant ledger, with quota checks before frame allocation or
mapping side effects. Anonymous reservation consumes a separate virtual page
quota, so guard ranges and Go-style `sysReserve` arenas do not spend physical
commit budget. Held MemoryObject caps charge for the backing they keep
reachable, and each live borrowed MemoryObject mapping reserves frame-grant
pages until it is unmapped. This prevents a process from mapping an object,
releasing the cap to drop the cap-slot charge, and keeping the backing pinned
without quota. The address space records borrowed pages separately from sparse
anonymous reservations so teardown and unmap can distinguish anonymous pages
from object-backed pages. Future file/network/DMA resources should reuse that
authority ledger instead of adding one-off counters per cap.

## Invariants

- Frame addresses are 4 KiB aligned.
- The frame bitmap's own frames are never returned as free frames.
- Upper-half kernel mappings are not user-accessible.
- Kernel text is RX, rodata is read-only NX, and data/bss are RW NX.
- User address spaces own only lower-half page-table frames.
- Process frame-grant usage covers committed anonymous VM pages, held
  `MemoryObject` caps, and live borrowed `MemoryObject` mappings.
- Process virtual-reservation usage covers reserved anonymous VM pages whether
  or not they are committed.
- Committed `VM_PROT_NONE` pages retain their frames and data while exposing no
  present user PTE; reserved uncommitted pages consume no frame-grant quota.
- Object-backed user mappings are tracked as borrowed pages and hold the
  `MemoryObject` backing alive until unmapped or address-space teardown.
- `MemoryObject` unmap/protect only succeeds for borrowed pages backed by the
  same object.
- `VirtualMemory` caps are bound to one address space and are not valid
  cross-process service exports.
- CapSet is read-only/no-execute; ring is writable/no-execute.
- `VirtualMemory` cannot reserve, map, commit, decommit, unmap, or protect the
  ring or CapSet pages.
- `VirtualMemory` commit/decommit/protect/unmap only succeeds for ranges
  covered by anonymous reservations owned by the cap's address space.
- Capability-ring CALL/RECV/RETURN buffers, transfer descriptors, process and
  thread wait completions, and private ParkSpace word reads must validate and
  copy/read while holding the target process `AddressSpace` lock.

## Code Map

- `capos-lib/src/frame_bitmap.rs` - host-testable physical frame bitmap core.
- `capos-lib/src/cap_table.rs` - capability holds and per-process
  `ResourceLedger` frame-grant accounting.
- `capos-lib/src/frame_ledger.rs` - bounded frame-grant helper retained for
  host tests.
- `kernel/src/mem/frame.rs` - Limine memory-map integration and global frame
  allocator wrapper.
- `kernel/src/mem/heap.rs` - kernel heap setup.
- `kernel/src/mem/paging.rs` - kernel remap, `AddressSpace`, page mapping,
  VM-cap page tracking, user copy helpers.
- `kernel/src/mem/validate.rs` - user-address bounds and legacy current-CR3
  validation helper.
- `kernel/src/cap/frame_alloc.rs` - FrameAllocator capability and cleanup.
- `demos/memoryobject-shared-parent/` and
  `demos/memoryobject-shared-child/` - QEMU shared MemoryObject smoke.
- `tools/qemu-memoryobject-shared-smoke.sh` - transcript checks for the shared
  MemoryObject smoke.
- `kernel/src/cap/virtual_memory.rs` - VirtualMemory capability.
- `kernel/src/spawn.rs` - ELF, stack, and TLS user mappings.
- `kernel/src/arch/x86_64/smap.rs` - SMEP/SMAP setup and legacy direct user
  access guard.

## Validation

- `cargo test-lib` covers frame bitmap, frame ledger, ELF parser, and cap-table
  pure logic.
- `cargo miri-lib` runs host-testable `capos-lib` tests under Miri when
  installed.
- `make kani-lib` proves the bounded mandatory frame-bitmap, stale-handle,
  cap-slot/frame-grant accounting, and transfer preflight fail-closed
  invariants when Kani is installed.
- `make run-smoke` validates ELF mapping, process teardown, TLS, and clean
  shell-led halt.
- `make run-spawn` validates MemoryObject-backed FrameAllocator cleanup,
  VirtualMemory reserve/commit/decommit/`VM_PROT_NONE`/quota/release smoke, and
  runtime spawn checks.
- `make run-memoryobject-shared` validates a parent allocating and mapping a
  MemoryObject, transferring it to a child, observing a child write through the
  same backing pages, unmapping both sides, and halting cleanly.
- `make run-ipc-zerocopy` validates the multi-message shared point-to-point
  buffer pattern at the substrate level: a producer transfers one
  `MemoryObject` to the consumer and then exchanges four record payloads
  through the shared mapping while endpoint CALLs carry only sequence numbers
  and checksums. This is a substrate proof, not the production data-plane
  shape: typed `SharedBuffer` with explicit producer/consumer ring metadata,
  notification primitives, and consuming service APIs (`File.readBuf`,
  `BlockDevice.readBlocks`, NIC RX/TX rings) are tracked under Open Work.
- `make run-spawn` validates ELF load failure rollback and frame exhaustion
  handling through `ProcessSpawner`.

## Open Work

- Extend frame-grant accounting only if future DMA pinning or service-owned
  shared-buffer pools need authority beyond held MemoryObject caps and live
  borrowed mappings.
- Define page-pinning or mapping-identity rules for future shared WaitSet,
  DMA, and service-owned shared-buffer paths that must keep physical backing
  stable beyond a single locked copy/read. The owning planning track is
  [Memory Authority Model](../backlog/memory-authority-model.md).
- Add file, block, network, and DMA service APIs that use MemoryObject-backed
  SharedBuffer caps for zero-copy data paths.
- Add DMA isolation and device memory capability boundaries before userspace
  drivers.
- Add huge-page handling only with explicit ownership and teardown rules.
