Proposal: Capability-Native System Monitoring
How capOS should expose logs, metrics, health, traces, crash records, and
service status without introducing global /proc, ambient log access, or a
privileged monitoring daemon that bypasses the capability model.
Problem
The current system is observable mostly through serial output, QEMU exit status, smoke-test lines, CQE error codes, and a small measurement-only build feature. That is enough for early kernel work, but it is not enough for a system whose claims depend on service decomposition, explicit authority, restart policy, auditability, and later cloud operation.
Monitoring is also not harmless. A monitoring service can reveal capability
topology, service names, scoped subject references, transport metadata, timing,
crash context, request payloads, and security decisions. If capOS imports a
Unix-style “read everything under
/proc” or “global syslog” model, monitoring becomes an ambient authority
escape hatch. If it imports a kernel-programmable tracing model too early, it
adds a large privileged execution surface before the basic service graph is
stable.
The design target is narrower: make operational state visible through typed, attenuable capabilities. A process should observe only the services, logs, and signals it was granted authority to inspect.
Current State
Implemented signal sources:
- Kernel diagnostics are printed through COM1 serial via
kprintln!, timestamped with the PIT tick counter. Panic and fault paths use a mutex-free emergency serial writer. - Userspace logging currently goes through the kernel
Consolecapability, backed directly by serial and bounded per call. - A Phase 1 capability log surface has landed:
LogSink/LogReaderover a bounded drop-oldest kernel ring (kernel/src/cap/log.rs), withSystemConfig.logLeveldrop enforcement at the sink, serial forwarding of accepted records, and scoped sink/reader caps granted at spawn (proof:make run-monitoring-log-smoke). Metrics, status, health, traces, crash records, the narrow kernel stats caps, and persistent retention remain future phases. - Runtime panics can use an emergency console path, then exit with a fixed code.
- Capability-ring CQEs carry structured transport results, including negative
CAP_ERR_*values and serializedCapExceptionpayloads. - The ring tracks
cq_overflow, corrupted SQ/CQ recovery, and bounded SQE dispatch, but these facts are not exported as normal metrics. ProcessSpawnerandProcessHandle.waitexpose basic child lifecycle observation, but restart policy, health checks, and exported-cap lifecycle are future work.capos-lib::ResourceLedgertracks cap slots, outstanding calls, scratch bytes, and frame grants, but only as local accounting state.- The
measurefeature adds benchmark-only counters and TSC helpers for controlledmake run-measureboots. SystemConfig.logLevelexists in the schema and is printed at boot, but there is no filtering, routing, or retention policy behind it.- An
AuditLogcapability exists in the schema and kernel (kernel/src/cap/audit_log.rs), used byAuthorityBrokerto record auth, setup, session, broker, and shell-launch events. Currently writes to serial viakprintln!; no ring-buffer reader cap or persistent retention yet. - A
HardwareAuditLogcapability with a bounded volatile ring buffer and drain/snapshot readers exists for DMA/MMIO/Interrupt cap lifecycle events (kernel/src/cap/hardware_audit.rs), including sequence numbers and dropped-record counts. A userspacehardware-audit-servicedrains it into a Store/Namespace-backed hash-chained segment ring and exposes scopedHardwareAuditReadersnapshots; the current backingStoreCapis RAM-backed, so post-reboot retention is still a storage-backend concern. hardware_release_logmodule (kernel/src/cap/hardware_release_log.rs) emits DMA pool, DMA buffer, DeviceMmio, and Interrupt release outcomes to serial; no reader cap or retention yet.
That means the system has useful raw signals and partial audit infrastructure but lacks a unified capability-shaped monitoring architecture with log routing, metrics export, and reader caps for most signal classes.
Design Principles
- Observation is authority. Reading logs, status, metrics, traces, crash records, or audit entries requires a capability.
- No global monitoring root.
SystemStatus(all),LogReader(all), andServiceSupervisor(all)are powerful caps. Normal sessions receive scoped wrappers. - Kernel facts, userspace policy. The kernel may expose bounded facts about processes, rings, resources, and faults. Retention, filtering, aggregation, health semantics, restart policy, and user-facing views belong in userspace.
- Separate signal classes. Logs, metrics, lifecycle events, traces, health, crash records, and audit logs have different readers, retention rules, and security properties.
- Bounded by construction. Every producer path has a byte, entry, or time budget. Loss is explicit and summarized.
- Payload capture is exceptional. Default tracing records headers, interface IDs, method IDs, sizes, result codes, and scoped transport identifiers only when authorized. Capturing method payloads needs a stronger cap because payloads may contain secrets.
- Serial remains emergency plumbing. Early boot, panic, and recovery still
need direct serial output. Normal services should receive log caps rather
than broad
Console. - Audit is not debug logging. Audit records security-relevant decisions and capability lifecycle events. It is append-only from producers and exposed through scoped readers.
- Pull by default, push when justified. Status and metrics are pull-shaped (reader polls a snapshot cap). Logs, lifecycle events, crash records, and audit entries are push-shaped (producer calls into a sink). Traces are pull with an explicit arm/drain lifecycle because capture is expensive. Each direction has its own cap surface; do not generalize one shape to cover all signals.
- Narrow kernel stats caps over one god-cap. The kernel exposes bounded
facts through several small read-only caps (ring, scheduler, resource
ledger, frames, endpoints, caps, crash) rather than one
KernelDiagnosticsthat grants everything. Narrow caps let an init-owned status service be assembled by composition, and let a broker lease a subset to an operator without handing over the rest.
Signal Taxonomy
Logs
Human-oriented diagnostic records:
- severity, component, service name, pid, optional subject/service reference, monotonic timestamp, message text;
- rate-limited at producer and log service boundaries;
- suitable for serial forwarding, ring-buffer retention, and later storage;
- not a source of truth for security decisions.
Metrics
Low-cardinality numeric state:
- per-process ring SQ/CQ occupancy,
cq_overflow, invalid SQE counts, opcode counts, transport error counts; - scheduler runnable/blocked counts, direct IPC handoffs, cap-enter timeouts, process exits;
- resource ledger usage: cap slots, outstanding calls, scratch bytes, frame grants, endpoint queue occupancy, VM mapped pages;
- heap/frame allocator pressure;
- later device, network, storage, and CPU-time counters.
Metric shape is fixed to three forms:
- Counter — monotonic
u64, reset only by reboot. Cumulative semantics make aggregation composable. - Gauge —
i64that moves both ways. Used for queue depths, free-frame counts, mapped-page counts. - Histogram — fixed bucket layout carried in the descriptor,
u64per bucket. Used for ring-dispatch duration, context-switch latency, IPC RTT.
Richer shapes (top-k tables, exponential histograms) are emitted as opaque typed payloads through the producer-scoped envelope described under “Core Interfaces”; the generic reader treats them as data, and a schema-aware viewer decodes them. Metrics should be snapshots or monotonic counters, not unbounded label streams.
Events
Discrete lifecycle facts:
- process spawned, started, exited, waited, killed, or failed to load;
- service declared healthy, unhealthy, restarting, quiescing, or upgraded;
- endpoint queue overflow, cancellation, disconnected holder, transfer rollback;
- resource quota rejection;
- device reset, interrupt storm, link up/down, block I/O error once devices exist.
Events are useful for supervisors and status views. They may also feed logs.
Traces
Bounded high-detail capture for debugging:
- SQE/CQE records around one pid, service subtree, endpoint, cap id, or error class;
- optional capnp payload capture only with explicit authority;
- offline schema-aware viewer for reproducing and explaining a failure;
- short retention by default.
This is the Ring as Black Box milestone from docs/tasks/README.md, not full replay.
Health
Declared service state:
- ready, starting, degraded, draining, failed, stopped;
- last successful health check and last failure reason;
- dependency health summaries;
- supervisor-owned restart intent and backoff state.
Health is not inferred only from process liveness. A process can be alive and unhealthy, or intentionally draining and still useful.
Crash Records
Panic, exception, and fatal userspace runtime records:
- boot stage, current pid if known, fault vector, RIP/CR2/error code where applicable, recent SQE context when safe, and last serial line cursor;
- bounded, redacted, and readable through a crash/debug capability;
- serial fallback remains mandatory when no reader exists.
Audit
Security and policy records:
- session creation, approval request, policy decision, cap grant, cap transfer, cap release/revocation, denial, declassification/relabel operation;
- no raw authentication proofs, private keys, bearer tokens, or full environment dumps;
- query access is scoped by session, service subtree, or operator role.
ITU-T X.700 Series Alignment
The ITU-T X.700 Systems Management framework (OSI management) predates modern observability stacks by two decades but still offers a cleaner decomposition than ad-hoc log/metric/trace categorization. capOS is not implementing CMIS/CMIP (X.710/X.711 assume ASN.1 BER over an OSI stack capOS will never speak); the value is the signal taxonomy and field model, not the transport.
| capOS signal class | Closest ITU-T | What we take from it |
|---|---|---|
| Logs | X.735 Log control function | Log record identity (moRef analog = component+pid+service_ref), severity mapping, scoped reader model. |
| Metrics | X.739 Metric objects and attributes | Fixed metric shapes (counter / gauge / histogram) as opposed to open-ended label streams. |
| Events | X.734 Event report management function | Discriminator-driven filtering, event-type taxonomy, producer/consumer separation. |
| Alarms (events) | X.733 Alarm reporting function | Perceived severity (cleared/indeterminate/warning/minor/major/critical), probable cause, specific problem, trend indication, proposed repair action. |
| Health | X.731 State management function | Operational / administrative / usage state model (enabled/disabled, unlocked/locked, idle/active/busy) feeding HealthState. |
| Audit | X.740 Security audit trail function | Audit record field model: event type, time, initiator, target, outcome, evidence chain. |
| Crash records | X.733 + X.736 Security alarm reporting function | Structured cause + severity for fatal/integrity events; security-relevant crashes flow through both the crash cap and the audit cap. |
FCAPS coverage. X.700/X.701 defines the five management functional areas: Fault, Configuration, Accounting, Performance, Security. This proposal covers Fault (crash records, alarms), Performance (metrics), and Security (audit). Configuration and Accounting are deliberately out of scope here:
- Configuration management (X.700 “C”) — versioned, signed
configuration deltas applied to running services. Partially covered
by
cloud-metadata-proposal.md(ManifestDelta) but capOS has no general configuration-management proposal yet. Candidate for a separate proposal once the manifest-executor and live-upgrade work stabilize. - Accounting management (X.700 “A”) — per-principal, per-session,
per-service resource-usage ledgers with retention and export. The
kernel’s
ResourceLedgeris the lowest layer; aggregation, persistence, and audit-grade usage records are undesigned. Candidate for a separate proposal; would compose with the audit cap and the user-identity session model.
Updated Field Mappings
LogRecord maps roughly onto X.735 logRecord:
X.735 logRecord capOS LogRecord
--------------- ---------------
logRecordId (cursor + pid + tick)
managedObjectClass component + service name
managedObjectInstance pid + service_ref
eventType Severity (lossy; add explicit
eventType once alarm/security
records share the pipe)
eventTime tick (monotonic; wall-clock when
available)
notificationIdentifier not modeled; add when events need
correlation IDs
Audit records should adopt X.740 fields explicitly. Proposed schema extension once the audit service ships:
enum AuditEventType {
# X.740 §6.1 event categories, pruned to what capOS actually records.
authentication @0; # login, logout, auth failure
accessControl @1; # grant, deny, revoke, transfer
policyDecision @2; # broker decision with plan + constraints
objectLifecycle @3; # capability create/destroy, object reap
securityAlarm @4; # X.736-shaped: integrity/confidentiality violation
serviceControl @5; # restart, upgrade, quiesce, resume
administrative @6; # manifest update, role change
}
enum AuditOutcome {
success @0;
failure @1;
denied @2;
pending @3; # multi-party approval outstanding
}
struct AuditRecord {
tick @0 :UInt64;
eventType @1 :AuditEventType;
initiator @2 :Data; # opaque principal/session ID
target @3 :Text; # interface + service identity
outcome @4 :AuditOutcome;
reason @5 :Text;
evidence @6 :Data; # opaque, bounded; no secrets
}
Alarms (X.733) are a structured subset of Events, not a new signal
class. The ServiceStatus / Health path emits alarms when degraded,
failed, or security-relevant thresholds trip:
enum PerceivedSeverity {
cleared @0;
indeterminate @1;
warning @2;
minor @3;
major @4;
critical @5;
}
enum ProbableCause {
# X.733 Annex A lists ~50 values; capOS starts with the handful that
# match known failure modes and extends as needed.
communicationsError @0;
integrityViolation @1;
operationalViolation @2;
softwareError @3;
underlyingResourceUnavailable @4;
qualityOfServiceAlarm @5;
securityAlarmIntegrity @6;
securityAlarmAccess @7;
}
struct Alarm {
tick @0 :UInt64;
managedObject @1 :Text; # service or cap identity
severity @2 :PerceivedSeverity;
probableCause @3 :ProbableCause;
specificProblem @4 :Text;
trend @5 :AlarmTrend;
proposedRepair @6 :Text;
}
The taxonomy buys two things the Unix-style “syslog + Prometheus + Jaeger” tower does not: (1) alarms as a first-class signal with a defined severity lattice and probable-cause field, which is how operators actually triage, and (2) audit as a distinct record type with fixed fields rather than a convention-layer over free-form log messages.
ITU-T references
- ITU-T Rec. X.700 (09/92) — Management framework
- ITU-T Rec. X.701 (08/97) — Systems management overview
- ITU-T Rec. X.733 (02/92) — Alarm reporting function
- ITU-T Rec. X.734 (09/92) — Event report management function
- ITU-T Rec. X.735 (09/92) — Log control function
- ITU-T Rec. X.736 (01/92) — Security alarm reporting function
- ITU-T Rec. X.740 (01/92) — Security audit trail function
- ITU-T Rec. X.731 (01/92) — State management function
- ITU-T Rec. X.739 (11/93) — Metric objects and attributes
Proposed Architecture
flowchart TD
Kernel[Kernel primitives] --> KD[KernelDiagnostics cap]
Kernel --> Serial[Emergency serial]
Init[init / root supervisor] --> LogSvc[Log service]
Init --> MetricsSvc[Metrics service]
Init --> StatusSvc[Status service]
Init --> AuditSvc[Audit log]
Init --> TraceSvc[Trace capture service]
KD --> MetricsSvc
KD --> StatusSvc
KD --> TraceSvc
Services[Services and drivers] --> LogSink[Scoped LogSink caps]
Services --> Health[Health caps]
Services --> AuditWriter[Scoped AuditWriter caps]
LogSink --> LogSvc
Health --> StatusSvc
AuditWriter --> AuditSvc
Broker[AuthorityBroker] --> Readers[Scoped readers]
Readers --> Shell[Shell / agent / operator tools]
StatusSvc --> Readers
LogSvc --> Readers
MetricsSvc --> Readers
TraceSvc --> Readers
AuditSvc --> Readers
The important property is that there is no ambient monitoring namespace. The graph is assembled by init and supervisors. Readers are capabilities, not paths.
Core Interfaces
These are conceptual interfaces. They should not be added to
schema/capos.capnp until the current manifest-executor work is complete and a
specific implementation slice needs them.
enum Severity {
debug @0;
info @1;
warn @2;
error @3;
critical @4;
}
struct LogRecord {
tick @0 :UInt64;
severity @1 :Severity;
component @2 :Text;
pid @3 :UInt32;
subjectRef @4 :Data; # privacy-preserving subject/session correlation
sessionRef @5 :Data; # optional scoped session correlation
serviceRef @6 :Data; # optional authorized service/component correlation
transportId @7 :Data; # debug-only ring/endpoint metadata, not identity
message @8 :Text;
}
struct LogFilter {
minSeverity @0 :Severity;
componentPrefix @1 :Text;
pid @2 :UInt32;
includeDebug @3 :Bool;
}
interface LogSink {
write @0 (record :LogRecord) -> ();
}
interface LogReader {
read @0 (cursor :UInt64, maxRecords :UInt32, filter :LogFilter)
-> (records :List(LogRecord), nextCursor :UInt64, dropped :UInt64);
}
LogSink is what ordinary services receive. LogReader is what shells,
operators, supervisors, and diagnostic tools receive. A scoped reader can filter
to one service subtree or session before the caller ever sees the record.
Monitoring terminology should use snake-case names in prose and map them to schema-style fields only at the Cap’n Proto boundary:
subject_ref / session_ref:
privacy-preserving identity or session correlation fields.
service_ref:
service instance or component correlation where the reader is authorized.
transport_id:
debug-only ring, endpoint, SQE/CQE, or waiter metadata; never subject
identity.
Legacy endpoint badge terminology must not leak into user-facing monitoring
identity. If a low-level transport path still stores a badge-shaped selector,
monitoring may expose it only as debug transport_id under an appropriate
diagnostic cap, not as subject_ref, session_ref, or service_ref.
struct ProcessStatus {
pid @0 :UInt32;
serviceName @1 :Text;
state @2 :Text;
capSlotsUsed @3 :UInt32;
capSlotsMax @4 :UInt32;
outstandingCalls @5 :UInt32;
cqReady @6 :UInt32;
cqOverflow @7 :UInt64;
lastExitCode @8 :Int64;
}
struct ServiceStatus {
name @0 :Text;
health @1 :Text;
pid @2 :UInt32;
restartCount @3 :UInt32;
lastError @4 :Text;
}
interface SystemStatus {
listProcesses @0 () -> (processes :List(ProcessStatus));
listServices @1 () -> (services :List(ServiceStatus));
service @2 (name :Text) -> (status :ServiceStatus);
}
SystemStatus is read-only. A broad instance can see the system; wrappers can
expose one service, one supervision subtree, or one session.
enum MetricKind {
counter @0;
gauge @1;
histogram @2;
}
struct MetricSample {
# Well-known fixed-name slot for counters and gauges the aggregator
# understands without additional schema lookup. Use this for stable
# kernel counters to keep the hot path allocation-free.
name @0 :Text;
kind @1 :MetricKind;
value @2 :Int64;
tick @3 :UInt64;
# Producer-scoped typed envelope for richer samples (histograms,
# top-k tables, per-subsystem structs). Payload is a capnp message;
# the schema is identified by `schemaHash` (capnp node id) and keyed
# per producer. Opaque to the generic reader; a schema-aware viewer
# decodes it.
producerId @4 :UInt64;
schemaHash @5 :UInt64;
payload @6 :Data;
}
struct MetricFilter {
prefix @0 :Text;
service @1 :Text;
}
interface MetricsReader {
snapshot @0 (filter :MetricFilter, maxSamples :UInt32)
-> (samples :List(MetricSample), truncated :Bool);
}
Early metrics should be fixed-name counters and gauges in the name/value
slot. Avoid arbitrary labels until there is a concrete memory and cardinality
policy. The producer-scoped envelope exists so richer samples do not force the
generic reader to learn a string-key taxonomy — if a producer needs per-queue
or per-device detail, it ships a typed capnp struct keyed by schemaHash
rather than synthesizing name strings.
struct TraceSelector {
pid @0 :UInt32;
serviceName @1 :Text;
errorCode @2 :Int32;
includePayloadBytes @3 :Bool;
}
struct TraceRecord {
tick @0 :UInt64;
pid @1 :UInt32;
opcode @2 :UInt16;
capId @3 :UInt32;
methodId @4 :UInt16;
interfaceId @5 :UInt64;
result @6 :Int32;
flags @7 :UInt16;
payload @8 :Data;
}
interface TraceCapture {
arm @0 (selector :TraceSelector, maxRecords :UInt32, maxBytes :UInt32)
-> (captureId :UInt64);
drain @1 (captureId :UInt64, maxRecords :UInt32)
-> (records :List(TraceRecord), complete :Bool, dropped :UInt64);
}
Payload capture should default off. A capture cap that can read payload bytes is closer to a debug privilege than a normal status cap.
enum HealthState {
starting @0;
ready @1;
degraded @2;
draining @3;
failed @4;
stopped @5;
}
interface Health {
check @0 () -> (state :HealthState, reason :Text);
}
interface ServiceSupervisor {
status @0 () -> (status :ServiceStatus);
restart @1 () -> ();
}
ServiceSupervisor is authority-changing. Normal monitoring readers should not
receive it. A broker can mint a leased ServiceSupervisor(net-stack) for one
operator action.
Kernel Diagnostics Contract
The kernel should expose a small read-only diagnostics surface for facts only the kernel can know:
- process table snapshot: pid, state, service name if known, wait state, exit code, ring physical identity hidden or omitted;
- ring snapshot: SQ/CQ head/tail-derived occupancy, overflow count, corrupted head/tail recovery counts, opcode/error counters;
- resource snapshot: cap slot usage, outstanding calls, scratch reservation, frame grant pages, mapped VM pages, free frame count, heap pressure;
- scheduler snapshot: tick count, current pid, run queue length, blocked count, direct IPC handoff count, timeout wake count;
- crash record: last panic/fault metadata and early boot stage.
The kernel should not implement log routing, alerting, dashboards, retention policy, restart decisions, RBAC, ABAC, or text search. Those are userspace service responsibilities.
Implementation shape:
- Maintain fixed-size counters in existing kernel structures where the source event already occurs.
- Prefer snapshots computed from existing state over duplicate counters when the cost is bounded.
- Expose snapshots through a small set of narrow read-only capabilities,
not one
KernelDiagnosticsgod-cap. The initial decomposition:SchedStats— tick count, current pid, run queue length, blocked count, direct IPC handoff count,cap_entertimeout/wake counts.FrameStats— free/used frame counts, frame-grant pages, allocator pressure histogram.RingStats— per-process SQ/CQ occupancy,cq_overflow, corrupted-head recovery counts, opcode counters, transport-error counters.CapTableStats— per-process slot occupancy, generation-rollover counts, insertion/remove rates.EndpointStats— per-endpoint waiter depth, RECV/RETURN match rate, abort/cancellation counts.CrashSnapshot— last panic/fault metadata, early boot stage, recent SQE context when safe.
- Each narrow cap exposes
snapshot() -> (sample :MetricSample)or a typed struct; none of them enumerates processes or reads cap tables beyond what the subsystem owns. A trusted status service composes the ones it needs; a broker leases a subset for operator sessions without the rest. ProcessInspector(pid-scoped process table, cap-table enumeration, VM map) is a distinct, stronger cap and lives with process-management authority, not with monitoring.- Convert broad diagnostics into scoped userspace wrappers before handing them to shells or applications.
- Keep panic/fault serial writes independent of any diagnostics service.
Promotion from the measure feature: the benchmark counters in
kernel/src/measure.rs graduate to always-on in RingStats / SchedStats
when the per-event cost is provably a single relaxed atomic add. Cycle-counter
instrumentation (rdtsc/rdtscp) stays behind cfg(feature = "measure")
because it is serializing and benchmark-only. The promotion threshold keeps
normal dispatch builds free of instrumentation cost without forcing monitoring
into a second build configuration.
Logging Model
Early boot has only serial. After init starts the log service, ordinary services
should receive LogSink rather than raw Console unless they need emergency
console access.
Recommended path:
- Kernel serial remains for boot, panic, and fault records.
- Init starts a userspace log service and passes scoped
LogSinkcaps to children. - The log service forwards selected records to
Consoleuntil persistent storage exists. SystemConfig.logLevelbecomes an initial policy input for which records the log service forwards and retains.- Session and operator tools receive scoped
LogReadercaps from a broker.
Services should not put secrets, raw capability payloads, full auth proofs, or arbitrary user input into logs without explicit redaction. Log records are data, not commands.
Metrics and Status
Status answers “what is alive and what state is it in.” Metrics answer “what is the numeric behavior over time.” Keeping them separate avoids a common failure mode where a human-readable status API grows into an unbounded metrics store.
Initial status fields should cover:
- pid, service name, binary name, process state, exit code;
- process handle wait state;
- supervisor health and restart policy once supervision exists;
- cap table occupancy and outstanding call count;
- ring CQ availability and overflow;
- endpoint queue occupancy where authorized.
Initial metrics should cover:
- ring dispatches, SQEs processed, per-op counts, transport error counts;
- cap-enter wait count, timeout count, wake count;
- scheduler context switches and direct IPC handoffs;
- frame free/used counts, frame grant pages, VM mapped pages;
- log records accepted, suppressed, dropped, and forwarded;
- trace records captured and dropped.
Timer/nohz/realtime metrics should be owned by monitoring rather than left as one-off debug prints once those features exist:
scheduler_tick_count{cpu};ticks_suppressed{cpu,mode};nohz_enter_count{cpu,kind};nohz_exit_count{cpu,reason};oneshot_deadline_miss_count;sqpoll_busy_ns;sqpoll_sleep_count;deadline_expired_count;budget_exhausted_count;realtime_overrun_count;donation_depth_max;housekeeping_offload_count.
These are correctness signals for nohz/realtime admission, not only performance counters. A scoped monitoring reader may observe them only under the same authority rules as other scheduler and service telemetry.
Current state alignment. Scheduler Phase D WFQ and Phase E
SchedulingContext have landed per docs/changelog.md (Phase D closed
2026-05-10), and Phase F is delivering one-SQ-consumer, nohz telemetry
counters, and housekeeping/deferred-work placement; automatic nohz
activation’s first increment is now closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md
(per the scheduler bullet in docs/tasks/README.md), and SQPOLL-driven auto-nohz
activation is also closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md: a
ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL
running/sleeping mode with a live owner is admitted for tick suppression,
with the SQPOLL ring-state re-check as the decisive rollback gate; the
CpuIsolationLease preflight performs real per-CPU periodic-tick suppression
for the narrow single-runnable-entity window with fail-closed rollback;
timeout-based auto-revoke and generic full-nohz for ordinary budgeted compute
leases are also landed. The nohz/realtime counter
families above describe the target monitoring surface for those
signals — the kernel may already maintain some counters internally as
Phase F lands them, but until the narrow read-only stats caps
(SchedStats / RingStats and friends) and a userspace metrics service
ship, those counters are scheduler-internal facts and not yet exported
through a monitoring cap. The metrics service is not authority to
trigger nohz mode changes; it observes counters under the authority
rules in this proposal.
Metric labels such as mode, kind, and reason must be fixed enums, not
free-form strings:
#![allow(unused)]
fn main() {
enum NoHzKind {
Idle,
KernelSqpoll,
AutoCompute,
AutoUserspacePoller,
RealtimeIsland,
}
enum TickSuppressionMode {
Idle,
SqpollNoHz,
AutoNoHz,
RealtimeIsland,
}
enum NoHzExitReason {
TimerDeadline,
Ipi,
DeviceIrq,
SecondRunnable,
NetworkForcedPeriodic,
DeferredWork,
LeaseRevoked,
ClocksourceUnsafe,
DebugWatchdog,
}
}
Future metric schemas should add enum variants through reviewed ABI changes rather than accepting arbitrary labels.
Avoid per-method, per-cap-id, per-transport-id, or per-user high-cardinality metrics by default. Those belong in short-lived traces or scoped logs.
Benchmark outputs follow the same cardinality rule. A completed, validated
benchmark run may import a small summary such as latest median, p95, sample
count, and pass/fail status for a named benchmark profile. Raw samples,
transcripts, host/QEMU configuration, correctness evidence, and comparison
tables are benchmark artifacts, not always-on monitoring metrics. Running a
profile that needs measure, debug taps, broad status readers, or other
diagnostic authority should emit an audit record because the act of measuring
can expose timing and topology data that ordinary services should not see.
Ring as Black Box
The first concrete monitoring milestone is the completed docs/tasks/README.md
Ring-as-Black-Box item. The visible milestone was achieved by commit da5f5e9
at 2026-04-24 03:13 UTC:
- define a bounded capture format for SQE/CQE records;
- export capture through a QEMU-only debug path;
- build a host-side viewer that decodes records and capnp payloads when payload capture is authorized;
- add one failing-call smoke whose captured log can be inspected offline.
This buys immediate debugging value without committing to durable audit, network export, service restart policy, or replay semantics.
This is inspection, not record/replay. Replay requires stronger determinism, payload retention, timer/input modeling, and capability-state checkpoints.
Capture path cost. The capture cap (working name RingTap) is
feature-gated (cfg(feature = "debug_tap") analogous to measure). Every
armed tap imposes a serializing fan-out on dispatch; keeping it out of the
default kernel feature set prevents always-on cost. Arming a tap is itself
an auditable event — the tapped process and the audit log observe it —
and tap grants respect move-semantics so a tap cannot be silently cloned
past its intended holder. Payload-capturing taps require a separately
leased cap distinct from metadata-only capture because payloads may
contain secrets.
Health and Supervision
Health and restart policy should live with supervisors, not in a central kernel daemon.
Each supervisor owns:
- a narrowed
ProcessSpawner; - child
ProcessHandlecaps; - the cap bundle needed to restart its subtree;
- optional
Healthcaps exported by children; - a
LogSinkandAuditWriterfor its own decisions.
Status services aggregate supervisor-reported health. They should distinguish:
- no process exists;
- process exists but never reported ready;
- process is alive and ready;
- process is alive but degraded;
- process exited normally;
- process failed and supervisor is backing off;
- process was intentionally stopped or draining.
Restart authority should be a separate ServiceSupervisor cap. A read-only
SystemStatus cap must not be able to restart anything.
Audit Integration
Audit should share infrastructure with logging only at the storage or transport layer. Its semantics are different.
Audit producers:
AuthorityBrokerfor policy decisions and leased grants;- supervisors for restarts and service lifecycle actions;
- session manager for session creation and logout;
- kernel or status service for cap transfer/release/revocation summaries when those events become part of the exported authority graph;
- recovery tools for repair actions.
Audit readers are scoped:
- a user can read records for its own session;
- an operator can read a service subtree;
- a recovery or security role can read broader streams after policy approval.
Audit entries must avoid secrets and payload dumps. They should record object identity, service identity, policy decision summaries, and capability interface classes rather than raw data.
Security and Backpressure
Monitoring must not become the easiest denial-of-service path.
Required controls:
- Per-process log token buckets, matching the Security Verification Track S.9 diagnostic aggregation design.
- Suppression summaries for repeated invalid submissions.
- Fixed-size ring buffers with explicit dropped counts.
- Maximum record size for logs, events, crash records, and traces.
- Bounded formatting outside interrupt context.
- No heap allocation in timer or panic paths.
- No unbounded metric label creation from user-controlled strings.
- Payload tracing disabled by default.
- Redaction rules at producer boundaries and at reader wrappers.
- Capability-scoped readers; no unauthenticated “debug all” endpoint.
When pressure forces dropping, preserve first-observation diagnostics and later summaries. Losing detailed logs is acceptable; corrupting scheduler progress or blocking the kernel on log I/O is not.
Relationship to Existing Proposals
- Service Architecture: monitoring services are ordinary userspace services spawned by init or supervisors. Logging policy and service topology stay out of the pre-init kernel path.
- Shell: the native and agent shell should receive scoped
SystemStatusandLogReadercaps in daily profiles, not global supervisor authority. - User Identity and Policy:
AuthorityBrokermints scoped readers and leased supervisor caps based on session policy;AuditLogrecords the decisions. - Error Handling: transport errors and
CapExceptionpayloads are monitoring signals, but retry policy remains userspace. - Authority Accounting: resource ledgers provide the first metrics substrate and define quota/backpressure boundaries.
- Security and Verification: hostile-input tests should cover log flood
aggregation and bounded diagnostic paths. Each new monitoring boundary
(kernel stats caps, log/metrics/trace/audit services, scheduler nohz
telemetry exports) must be carried into the
docs/proposals/security-and-verification-proposal.mdTrack S.7 trust-boundary inventory before downstream services rely on it; the inventory is the canonical record that a boundary has been reviewed, not this proposal. - Live Upgrade: health, audit, and service status become prerequisites for credible upgrade orchestration.
- System Performance Benchmarks: benchmark runners may read scoped status and metrics before and after a run, but benchmark artifacts and OS-comparison reports live outside the always-on metrics service. Only low-cardinality, validated summaries should be imported into monitoring.
Implementation Plan
-
Document the model. Keep monitoring as a future architecture proposal and do not disturb the current manifest-executor milestone.
-
Ring as Black Box. Completed by commit
da5f5e9at2026-04-24 03:13 UTC: bounded SQE/CQE capture, host-side decoding, and one failing-call smoke form the first useful monitoring artifact. -
Userspace log service. (Phase 1 landed.)
LogSink/LogReaderschemas plusLogRecord/LogFilterexist (additive ordinals, reusingLogLevelas the severity type). A bounded drop-oldest kernel ring (kernel/src/cap/log.rs) backs both caps: the sink stamps the monotonic tick, drops records below the boot-seededSystemConfig.logLevelthreshold (accepted = false), bounds record size, and forwards accepted records to serial; the reader returns cursor/filtered records withnextCursorand adroppedoverflow count. ScopedLogSink/LogReadercaps are granted to children at spawn;make run-monitoring-log-smokeproves the drop, the read-back, and the reader-sideminLevelfilter. Remaining: the widerSeverity(withcritical), the correlation fields (subjectRef/sessionRef/serviceRef/transportId), per-process token buckets / suppression summaries, and persistent retention. -
Narrow kernel stats caps and SystemStatus. Add the narrow read-only caps (
SchedStats,FrameStats,RingStats,CapTableStats,EndpointStats,CrashSnapshot) as bounded snapshot surfaces. A userspaceSystemStatusservice composes the ones it needs and exposes scoped wrappers to shells and operator tools. LeaveProcessInspectorout of this step — it belongs with process-management authority, not monitoring. -
Metrics snapshots. Add fixed counters and gauges for ring, scheduler, resource, log, and trace state. Keep labels static until a cardinality policy exists.
-
Health and supervisor status. Add
Healthand read-only supervisor status once restart policy and exported service caps are concrete. Keep restart authority in separateServiceSupervisorcaps. -
Audit path. Add append-only audit records for broker decisions, cap grants, releases, revocations, restarts, and recovery actions. Start serial or memory backed; move to storage once the storage substrate exists.
-
Crash records. Preserve bounded panic/fault metadata across the current boot where possible; later store records durably.
-
Device, network, and storage metrics. Add driver metrics only after those drivers exist: interrupts, DMA/bounce usage, queue depth, RX/TX/drop/error counts, block latency, and reset events.
Non-Goals
- No global
/procor/sysequivalent with ambient read access. - No kernel-resident dashboard, alert manager, text search, or policy engine.
- No programmable kernel tracing language in the first monitoring design.
- No promise of durable log retention before storage exists.
- No default payload tracing.
- No service restart authority bundled into ordinary read-only status caps.
- No network export path until networking and policy can constrain it.
Open Questions
- Should
KernelDiagnosticsexpose snapshots only, or also a bounded event cursor? - What is the minimum timestamp model before wall-clock time exists?
- Should log records carry local cap IDs, stable object IDs, or only interface and service metadata by default?
- How should schema-aware trace decoding find schemas before a full
SchemaRegistryexists? - Which crash fields are safe to expose to non-recovery sessions?
- What retention policy is acceptable before persistent storage?
- Should
MetricsReaderuse typed structs for each subsystem instead of generic name/value samples? - Where should remote monitoring export fit once network transparency exists: a dedicated exporter service, capnp-rpc forwarding, or storage replication?
Cross-References
This proposal is reader-facing target design. The canonical trackers for the observability-adjacent risks and verification obligations it depends on live elsewhere:
docs/proposals/security-and-verification-proposal.mdTrack S.7 – Stage-6-aware refresh owns the trust-boundary inventory that any new monitoring boundary (kernel stats caps, log/metrics/trace/audit services, scheduler nohz telemetry exports, payload-capturing taps) must be carried into before downstream services rely on it. Track S.7 already lists the active scheduler-evolution surfaces (Phase D WFQ, Phase ESchedulingContext, Phase F one-SQ-consumer and nohz telemetry) plus the WASI host-adapter Phase W.4 entropy/argv boundary as inventory items to carry forward.docs/design-risks-register.mdR12 – Verification coverage is partial, not full proof is the canonical caveat for any monitoring claim that could be read as a verified property. Bounded Kani/Loom/Miri/proptest coverage plus the panic-surface inventory are not whole-system functional refinement; monitoring records and audit entries describing security- relevant decisions must respect that distinction in their wording.docs/design-risks-register.mdQ9 – CPU accounting and scheduling contexts is the canonical answer for the CPU-time, weighted-vruntime, andSchedulingContextbudget/donation/depletion semantics that monitoring metrics should observe rather than redefine. The nohz/realtime counter families in this proposal target the same surfaces; cross-service donation policy, full nohz activation, isolation leases, and fairness across principals remain proposal-shaped per Q9 and are tracked indocs/proposals/scheduler-evolution-proposal.mdanddocs/backlog/scheduler-evolution.md.
Adjacent risk-register entries observed by monitoring but owned elsewhere
include R4 (Resource accounting fragmentation, source of the
ResourceLedger metrics substrate), R8 (Networking lives inside the kernel
TCB, gating exporter-service placement), and R11 (Pre-auth and post-auth
share a shell process, gating who may receive scoped LogReader /
SystemStatus / AuditLog readers).