Proposal: Resource Accounting and Quotas
Cross-cutting resource profiles, ledgers, reservation semantics, and verification gates for bounded capOS sessions, services, drivers, storage, networking, tests, and future language runtimes.
Related
- Authority Accounting records the current transfer and resource-accounting invariants.
- Memory Management documents the current frame-grant and MemoryObject accounting baseline.
- Go VirtualMemory Contract provides the first concrete virtual-reservation versus physical-commit ledger split for a future language runtime.
Problem
capOS already has several resource limits: cap slots, frame grants, timer waiters, thread and kernel-stack quotas, ring scratch, and spawn preflight checks. Those are useful but fragmented. Local accounts, guests, anonymous callers, external sessions, service accounts, drivers, storage services, network stacks, tests, and future runtimes all need the same rule:
No workload receives implicit unlimited consumption of finite system resources.
This proposal defines the common model. It extends the Stage S.9 authority
graph and per-process ResourceLedger design rather than replacing it.
Principles
- A
ResourceProfileis a policy template, not authority. - Actual enforcement happens through ledgers, capability wrappers, brokers, supervisors, and kernel/resource-service admission checks.
- Every resource class has one ledger of record. Mirrors for status, metrics, or audit are derived views and must not be used for enforcement.
- Reservation happens before side effects. Commit publishes the resource. Release and rollback are mandatory on all success, failure, timeout, revocation, and process-exit paths.
- Identity metadata selects policy. It never consumes, releases, or bypasses quota by itself.
- Quota donation is explicit. A caller may donate budget to a service call, but a service cannot silently spend the caller’s unrelated budget.
Resource Profiles
Resource profiles are named templates selected by account records, manifest seed data, service policy, external admission rules, or test manifests. A profile should contain policy intent, not raw authority:
struct ResourceProfile {
profileId @0 :Data;
versionId @1 :Data;
epoch @2 :UInt64;
homeQuotaBytes @3 :UInt64;
tempQuotaBytes @4 :UInt64;
processLimit @5 :UInt32;
threadLimit @6 :UInt32;
capLimit @7 :UInt32;
memoryCommitLimitBytes @8 :UInt64;
frameGrantLimitPages @9 :UInt64;
memoryVirtualReservationLimitBytes @20 :UInt64;
endpointQueueLimit @10 :UInt32;
inFlightCallLimit @11 :UInt32;
pendingIpcSubmissionLimit @12 :UInt32;
ringScratchLimitBytes @13 :UInt64;
logQuotaBytesPerWindow @14 :UInt64;
networkProfile @15 :Text;
cpuBudgetUsPerWindow @16 :UInt64;
cpuWindowUs @17 :UInt64;
timerWaiterLimit @18 :UInt32;
launcherProfile @19 :Text;
}
The profile is evaluated by a broker or supervisor. The result is a set of ledger limits, wrapper caps, service-specific budgets, and spawn constraints. Changing a profile does not change a running workload until a trusted service issues new limits, revokes old caps, or starts a replacement workload.
Ledgers of Record
The ledger of record depends on the resource owner:
| Resource | Ledger of record |
|---|---|
| Capability slots | Process CapTable / process resource ledger |
| Processes and child subtrees | Supervisor or ProcessSpawner ledger |
| Threads and kernel stacks | Process-owned thread/kernel-stack ledger |
| Anonymous virtual reservations | Address-space or VM service reservation ledger |
| Anonymous committed memory | Address-space or VM service ledger |
| Physical frames and frame grants | Frame allocator / holder ledger |
| MemoryObject mappings | Per-process frame-grant ledger plus address-space tracking |
| Endpoint queues | Endpoint object ledger |
| In-flight calls and result caps | Caller/callee transport ledger |
| Pending IPC submissions | Process ring/resource ledger |
| Ring scratch and request buffers | Process ring/resource ledger |
| Timer sleeps and waiters | Timer service waiter ledger |
| Log bytes | Log service token bucket / retention ledger |
| Storage bytes and namespace entries | Store/Namespace service ledger |
| Temporary, cache, and home storage | Store/Namespace scoped sub-ledgers |
| Network listeners, sockets, and bytes | Network service or socket cap ledger |
| CPU share and runtime budget | Scheduler or scheduling-context ledger |
| DMA pool bytes, DMA buffer count, descriptor/ring depth, MMIO mappings, interrupt holds, in-flight DMA submissions | Device-manager ledgers, later |
| Model tokens, provider calls, tool calls | Provider/agent gateway ledgers, later |
No second module should maintain an independent enforcement counter for the same resource. A status service may cache values for display only if it treats the ledger owner as authoritative and never grants or rejects based on stale cache state.
Relationship To Tickless And Realtime CPU Authority
The CPU terms in Tickless and Realtime Scheduling reuse this resource-accounting model:
ResourceProfile.cpuBudgetUsPerWindow:
coarse policy template only. Selecting a profile does not mint executable
CPU-time authority.
ResourceLedger CPU budget:
coarse best-effort accounting before realtime contexts exist, and the ledger
of record for non-realtime CPU share/runtime limits.
SchedulingContext:
spendable CPU-time object for realtime or admitted execution. It carries
budget, period, relative deadline, priority/criticality, CPU mask, and
overrun policy.
CpuIsolationLease:
CPU placement, exclusivity, and noise/nohz authority. It is not CPU budget
and must charge consumed runtime to a SchedulingContext or scheduler
ResourceLedger.
NoHzEligibility / NoHzActivation:
reviewed eligibility plus scheduler-proven current CPU state. They do not
grant resource credit.
RealtimeIsland:
admitted bundle consuming SchedulingContexts plus memory, device, ring, and
optional CpuIsolationLease reservations.
Do not create a second CPU budget system under nohz, SQPOLL, or realtime terminology. Those features select placement and execution mode; CPU time is still charged through scheduling-context or scheduler-ledger authority.
Reservation Lifecycle
Every resource allocation follows the same lifecycle:
reserve(request, limits, expected_state)
-> reserved(token)
-> denied(reason)
commit(token)
-> committed(resource)
-> rollback(token, reason)
release(resource)
-> released
Rules:
reservevalidates structure, bounds, ownership, and available quota before any externally visible mutation.commitpublishes exactly the resource that was reserved.rollbackrestores all ledgers touched by the reservation.releaseis idempotent from the caller’s perspective but changes ledger state at most once.- Process exit and cap revocation bulk-release all resources owned only by the exiting process or revoked hold edge.
- Stale handles, exhausted quotas, malformed limits, and unknown profile versions fail closed with typed errors or denials, not panics.
The S.9 transfer transaction is the concrete model for cap transfer and spawn. Other services should reuse the same preflight, reservation, commit, rollback, and audit vocabulary.
Donation and Shared Services
Shared services handle many sessions in one process. They need bounded server-side state without treating caller identity as authority.
Donation is a lease from one ledger to another for a named operation:
Donation {
donorSessionId
donorLedgerId
receiverServiceId
resourceClass
amount
expiresAtMs
callId
}
A donation can pay for queue entries, scratch bytes, temporary storage, outbound bytes, model tokens, or CPU budget needed to serve one request. It does not grant unrelated authority to the service and does not let the caller spend the service’s own management budget. When the call finishes, times out, is cancelled, or the session exits, unused donation is returned and used donation is charged to the donor’s accounting record.
Services may also have their own base budgets for resident state. Per-client budgets and service base budgets are separate ledger entries so a single client cannot hide consumption inside the service account.
Profile Binding
Profiles are selected by policy inputs:
- manifest-seeded operators and recovery identities,
- local account records,
- service account records,
- guest and anonymous admission rules,
- external identity bindings,
- test manifests and QEMU smoke profiles,
- future driver, storage, network, and runtime launch policies.
The broker or supervisor translates those profiles into concrete limits at session creation, spawn, service start, or cap minting time. The translation must record:
- profile ID, version ID, and policy epoch,
- ledger owner and resource class,
- hard limit and optional token-bucket window,
- source policy and approving broker/supervisor,
- audit record ID for the grant,
- expiry or revocation epoch if the budget is leased.
A session can carry profile summaries for audit and display, but the summaries do not enforce quota. Enforcement lives where the resource is created or used.
Resource Classes
Kernel and Process Resources
Cap slots, process count, thread count, kernel stacks, pending IPC submissions, ring scratch, outstanding calls, and endpoint queue entries are kernel or kernel-object resources. Their checks belong before spawn, thread creation, transfer, IPC, and ring dispatch side effects.
Memory
Current VirtualMemory mappings and held MemoryObject caps charge the
process-owned frame-grant ledger of record. The address space records borrowed
object-backed pages at the same tracking limit so unmap and teardown can
distinguish them from anonymous pages, but that tracking is not a second
enforcement counter. Future reserve/commit/decommit semantics split virtual
reservation from committed physical memory: VirtualMemory.reserve charges a
virtual-reservation ledger, while VirtualMemory.commit and compatibility
VirtualMemory.map charge the committed-memory/frame ledger before pages
become accessible. Decommit releases physical commit budget while preserving
virtual reservation budget until unmap.
Storage
Storage services own byte, object, namespace-entry, and snapshot ledgers.
home, config, cache, and tmp are separate sub-ledgers even when backed
by the same Store. Temporary session storage expires on logout or session
expiry. Cache quota may be reclaimed by policy. Home/config quota should not
be reclaimed without explicit account/storage policy.
Logging and Audit
Log volume uses token buckets and retention limits. Audit entries required for security state transitions should have a protected emergency path; ordinary application logs must not starve audit. If audit storage is unavailable, the system enters a bounded emergency mode rather than silently dropping mandatory security events.
Network
Network profiles select listener authority, outbound connection classes, socket counts, byte windows, and remote scopes. A normal local account may receive client network caps; listener authority requires service policy, operator policy, or an application-specific grant. Anonymous remote sessions receive only protocol state needed to authenticate or create an account.
CPU and Scheduling
CPU share and runtime budget belong to the scheduler or future scheduling context. Until full scheduling-context donation exists, CPU limits can be coarse token buckets and supervisor policy. Later realtime, media, and driver work should use explicit period/budget/deadline records rather than ad hoc sleep or polling loops.
Devices and Providers
DMA pools, MMIO mappings, interrupts, cloud provider calls, LLM tokens, media frames, and external API calls are scarce resources too. The first proof may use service-level ledgers, but the rule is the same: one ledger of record, typed reservation, explicit release, audit-visible denial.
For the S.11.2 userspace-driver transition, device ledgers must account at
least DMA pool bytes, DMA buffer count, descriptor or ring depth, MMIO mappings,
interrupt holds, and in-flight DMA submissions. A DMAPool reservation is not
only memory allocation; it is also device-visible write authority and must be
released through the same revoke/quiesce/reset path that makes future reuse
safe.
Canonical device ledger concepts:
dma_pool_bytes
dma_buffer_count
dma_descriptor_count
mmio_mapping_count
interrupt_hold_count
inflight_dma_submission_count
These fields are device-manager accounting concepts even if the first
implementation uses different internal names. They must have one ledger of
record. DMA pool bytes and buffer counts are not interchangeable with ordinary
MemoryObject ownership, because device-visible memory also carries IOVA,
descriptor, reset, and stale-completion obligations.
Failure Semantics
Quota failure is a normal result, not a crash:
| Condition | Result |
|---|---|
| Malformed request | Invalid input / typed transport error |
| Caller exceeds hard limit | Quota denied / overloaded |
| Service base budget exhausted | Service overloaded |
| Donated budget exhausted | Request denied or partial result |
| Stale profile version | Denied; refresh session/profile |
| Ledger mismatch or rollback failure | Enter recovery/emergency mode |
Retry policy belongs to the caller or supervisor. Kernel and service code must not spin, allocate unbounded retry queues, or emit unbounded diagnostics after quota failure.
Audit and Status
Auditable events:
- profile-to-ledger translation,
- reservation denial,
- successful budget grant,
- donation start/commit/release,
- cap or resource revocation,
- process-exit cleanup,
- rollback or recovery-mode entry,
- administrative profile change.
Status views should expose current usage, limits, denial counts, and suppressed diagnostic counts by resource class. They must redact sensitive account, network, provider, and object identifiers unless the viewer holds a suitable audit/status cap.
Verification Gates
Before treating resource profiles as complete for any caller class, add checks at the affected resource owner:
- Host tests for limit parsing, stale profile rejection, reservation/rollback, and one-ledger-of-record invariants.
- QEMU smokes proving quota denial for process/thread/cap, pending IPC submissions, endpoint queue, timer waiter, memory, storage, log, and network resources as they exist.
- Hostile exhaustion tests that do not panic, leak frames, leak cap slots, or leave partial child processes.
- Process-exit and revocation tests proving all charges release exactly once.
- Audit/status tests showing denial and cleanup are visible without exposing secrets.
- Kani or property tests for small pure ledger primitives when bounds are fixed enough to model.
Relationships
- Authority Accounting: S.9 defines the current authority graph and process-ledger transaction model. This proposal generalizes the quota vocabulary to services, storage, networking, sessions, and future devices.
- User Identity and Policy: account and session resource profiles select templates. Brokers and supervisors translate them into ledgers and wrapper caps.
- OOM Handling and Swap: memory commitment, reclaim, and swap policy are the memory-specific part of this model.
- Storage and Naming: Store/Namespace services own storage ledgers for homes, config, cache, tmp, snapshots, and imports.
- System Monitoring: status and metrics expose derived ledger views, not parallel enforcement counters.
Non-Goals
- No Unix cgroups clone as the primary abstraction.
- No identity-based quota enforcement in the kernel.
- No global mutable quota database trusted by every subsystem.
- No claim that existing code already enforces every resource class above.
- No unbounded best-effort mode for guests, anonymous callers, tests, or service accounts.
Open Questions
- Which ledger IDs and status schemas should become stable ABI first?
- How much CPU-budget enforcement is useful before scheduling contexts exist?
- Should quota donation be represented as a general capability type or as method-specific sideband on selected service calls?
- Which storage quota primitive is first: bytes, object count, namespace entries, or snapshots?
- Which proofs belong in
capos-libversus resource-service-specific tests?