Proposal: Capability-Native System Monitoring

How capOS should expose logs, metrics, health, traces, crash records, and service status without introducing global /proc, ambient log access, or a privileged monitoring daemon that bypasses the capability model.

Problem

The current system is observable mostly through serial output, QEMU exit status, smoke-test lines, CQE error codes, and a small measurement-only build feature. That is enough for early kernel work, but it is not enough for a system whose claims depend on service decomposition, explicit authority, restart policy, auditability, and later cloud operation.

Monitoring is also not harmless. A monitoring service can reveal capability topology, service names, scoped subject references, transport metadata, timing, crash context, request payloads, and security decisions. If capOS imports a Unix-style “read everything under /proc” or “global syslog” model, monitoring becomes an ambient authority escape hatch. If it imports a kernel-programmable tracing model too early, it adds a large privileged execution surface before the basic service graph is stable.

The design target is narrower: make operational state visible through typed, attenuable capabilities. A process should observe only the services, logs, and signals it was granted authority to inspect.

Current State

Implemented signal sources:

Kernel diagnostics are printed through COM1 serial via kprintln!, timestamped with the PIT tick counter. Panic and fault paths use a mutex-free emergency serial writer.
Userspace logging currently goes through the kernel Console capability, backed directly by serial and bounded per call.
A Phase 1 capability log surface has landed: LogSink/LogReader over a bounded drop-oldest kernel ring (kernel/src/cap/log.rs), with SystemConfig.logLevel drop enforcement at the sink, serial forwarding of accepted records, and scoped sink/reader caps granted at spawn (proof: make run-monitoring-log-smoke). Metrics, status, health, traces, crash records, the narrow kernel stats caps, and persistent retention remain future phases.
Runtime panics can use an emergency console path, then exit with a fixed code.
Capability-ring CQEs carry structured transport results, including negative CAP_ERR_* values and serialized CapException payloads.
The ring tracks cq_overflow, corrupted SQ/CQ recovery, and bounded SQE dispatch, but these facts are not exported as normal metrics.
ProcessSpawner and ProcessHandle.wait expose basic child lifecycle observation, but restart policy, health checks, and exported-cap lifecycle are future work.
capos-lib::ResourceLedger tracks cap slots, outstanding calls, scratch bytes, and frame grants, but only as local accounting state.
The measure feature adds benchmark-only counters and TSC helpers for controlled make run-measure boots.
SystemConfig.logLevel exists in the schema and is printed at boot, but there is no filtering, routing, or retention policy behind it.
An AuditLog capability exists in the schema and kernel (kernel/src/cap/audit_log.rs), used by AuthorityBroker to record auth, setup, session, broker, and shell-launch events. Currently writes to serial via kprintln!; no ring-buffer reader cap or persistent retention yet.
A HardwareAuditLog capability with a bounded volatile ring buffer and drain/snapshot readers exists for DMA/MMIO/Interrupt cap lifecycle events (kernel/src/cap/hardware_audit.rs), including sequence numbers and dropped-record counts. A userspace hardware-audit-service drains it into a Store/Namespace-backed hash-chained segment ring and exposes scoped HardwareAuditReader snapshots; the current backing StoreCap is RAM-backed, so post-reboot retention is still a storage-backend concern.
hardware_release_log module (kernel/src/cap/hardware_release_log.rs) emits DMA pool, DMA buffer, DeviceMmio, and Interrupt release outcomes to serial; no reader cap or retention yet.

That means the system has useful raw signals and partial audit infrastructure but lacks a unified capability-shaped monitoring architecture with log routing, metrics export, and reader caps for most signal classes.

Design Principles

Observation is authority. Reading logs, status, metrics, traces, crash records, or audit entries requires a capability.
No global monitoring root. SystemStatus(all), LogReader(all), and ServiceSupervisor(all) are powerful caps. Normal sessions receive scoped wrappers.
Kernel facts, userspace policy. The kernel may expose bounded facts about processes, rings, resources, and faults. Retention, filtering, aggregation, health semantics, restart policy, and user-facing views belong in userspace.
Separate signal classes. Logs, metrics, lifecycle events, traces, health, crash records, and audit logs have different readers, retention rules, and security properties.
Bounded by construction. Every producer path has a byte, entry, or time budget. Loss is explicit and summarized.
Payload capture is exceptional. Default tracing records headers, interface IDs, method IDs, sizes, result codes, and scoped transport identifiers only when authorized. Capturing method payloads needs a stronger cap because payloads may contain secrets.
Serial remains emergency plumbing. Early boot, panic, and recovery still need direct serial output. Normal services should receive log caps rather than broad Console.
Audit is not debug logging. Audit records security-relevant decisions and capability lifecycle events. It is append-only from producers and exposed through scoped readers.
Pull by default, push when justified. Status and metrics are pull-shaped (reader polls a snapshot cap). Logs, lifecycle events, crash records, and audit entries are push-shaped (producer calls into a sink). Traces are pull with an explicit arm/drain lifecycle because capture is expensive. Each direction has its own cap surface; do not generalize one shape to cover all signals.
Narrow kernel stats caps over one god-cap. The kernel exposes bounded facts through several small read-only caps (ring, scheduler, resource ledger, frames, endpoints, caps, crash) rather than one KernelDiagnostics that grants everything. Narrow caps let an init-owned status service be assembled by composition, and let a broker lease a subset to an operator without handing over the rest.

Signal Taxonomy

Logs

Human-oriented diagnostic records:

severity, component, service name, pid, optional subject/service reference, monotonic timestamp, message text;
rate-limited at producer and log service boundaries;
suitable for serial forwarding, ring-buffer retention, and later storage;
not a source of truth for security decisions.

Metrics

Low-cardinality numeric state:

per-process ring SQ/CQ occupancy, cq_overflow, invalid SQE counts, opcode counts, transport error counts;
scheduler runnable/blocked counts, direct IPC handoffs, cap-enter timeouts, process exits;
resource ledger usage: cap slots, outstanding calls, scratch bytes, frame grants, endpoint queue occupancy, VM mapped pages;
heap/frame allocator pressure;
later device, network, storage, and CPU-time counters.

Metric shape is fixed to three forms:

Counter — monotonic u64, reset only by reboot. Cumulative semantics make aggregation composable.
Gauge — i64 that moves both ways. Used for queue depths, free-frame counts, mapped-page counts.
Histogram — fixed bucket layout carried in the descriptor, u64 per bucket. Used for ring-dispatch duration, context-switch latency, IPC RTT.

Richer shapes (top-k tables, exponential histograms) are emitted as opaque typed payloads through the producer-scoped envelope described under “Core Interfaces”; the generic reader treats them as data, and a schema-aware viewer decodes them. Metrics should be snapshots or monotonic counters, not unbounded label streams.

Events

Discrete lifecycle facts:

process spawned, started, exited, waited, killed, or failed to load;
service declared healthy, unhealthy, restarting, quiescing, or upgraded;
endpoint queue overflow, cancellation, disconnected holder, transfer rollback;
resource quota rejection;
device reset, interrupt storm, link up/down, block I/O error once devices exist.

Events are useful for supervisors and status views. They may also feed logs.

Traces

Bounded high-detail capture for debugging:

SQE/CQE records around one pid, service subtree, endpoint, cap id, or error class;
optional capnp payload capture only with explicit authority;
offline schema-aware viewer for reproducing and explaining a failure;
short retention by default.

This is the Ring as Black Box milestone from docs/tasks/README.md, not full replay.

Health

Declared service state:

ready, starting, degraded, draining, failed, stopped;
last successful health check and last failure reason;
dependency health summaries;
supervisor-owned restart intent and backoff state.

Health is not inferred only from process liveness. A process can be alive and unhealthy, or intentionally draining and still useful.

Crash Records

Panic, exception, and fatal userspace runtime records:

boot stage, current pid if known, fault vector, RIP/CR2/error code where applicable, recent SQE context when safe, and last serial line cursor;
bounded, redacted, and readable through a crash/debug capability;
serial fallback remains mandatory when no reader exists.

Audit

Security and policy records:

session creation, approval request, policy decision, cap grant, cap transfer, cap release/revocation, denial, declassification/relabel operation;
no raw authentication proofs, private keys, bearer tokens, or full environment dumps;
query access is scoped by session, service subtree, or operator role.

ITU-T X.700 Series Alignment

The ITU-T X.700 Systems Management framework (OSI management) predates modern observability stacks by two decades but still offers a cleaner decomposition than ad-hoc log/metric/trace categorization. capOS is not implementing CMIS/CMIP (X.710/X.711 assume ASN.1 BER over an OSI stack capOS will never speak); the value is the signal taxonomy and field model, not the transport.

capOS signal class	Closest ITU-T	What we take from it
Logs	X.735 Log control function	Log record identity (moRef analog = `component`+`pid`+`service_ref`), severity mapping, scoped reader model.
Metrics	X.739 Metric objects and attributes	Fixed metric shapes (counter / gauge / histogram) as opposed to open-ended label streams.
Events	X.734 Event report management function	Discriminator-driven filtering, event-type taxonomy, producer/consumer separation.
Alarms (events)	X.733 Alarm reporting function	Perceived severity (cleared/indeterminate/warning/minor/major/critical), probable cause, specific problem, trend indication, proposed repair action.
Health	X.731 State management function	Operational / administrative / usage state model (enabled/disabled, unlocked/locked, idle/active/busy) feeding `HealthState`.
Audit	X.740 Security audit trail function	Audit record field model: event type, time, initiator, target, outcome, evidence chain.
Crash records	X.733 + X.736 Security alarm reporting function	Structured cause + severity for fatal/integrity events; security-relevant crashes flow through both the crash cap and the audit cap.

FCAPS coverage. X.700/X.701 defines the five management functional areas: Fault, Configuration, Accounting, Performance, Security. This proposal covers Fault (crash records, alarms), Performance (metrics), and Security (audit). Configuration and Accounting are deliberately out of scope here:

Configuration management (X.700 “C”) — versioned, signed configuration deltas applied to running services. Partially covered by cloud-metadata-proposal.md (ManifestDelta) but capOS has no general configuration-management proposal yet. Candidate for a separate proposal once the manifest-executor and live-upgrade work stabilize.
Accounting management (X.700 “A”) — per-principal, per-session, per-service resource-usage ledgers with retention and export. The kernel’s ResourceLedger is the lowest layer; aggregation, persistence, and audit-grade usage records are undesigned. Candidate for a separate proposal; would compose with the audit cap and the user-identity session model.

Updated Field Mappings

LogRecord maps roughly onto X.735 logRecord:

X.735 logRecord                    capOS LogRecord
---------------                    ---------------
logRecordId                        (cursor + pid + tick)
managedObjectClass                 component + service name
managedObjectInstance              pid + service_ref
eventType                          Severity (lossy; add explicit
                                    eventType once alarm/security
                                    records share the pipe)
eventTime                          tick (monotonic; wall-clock when
                                    available)
notificationIdentifier             not modeled; add when events need
                                    correlation IDs

Audit records should adopt X.740 fields explicitly. Proposed schema extension once the audit service ships:

enum AuditEventType {
  # X.740 §6.1 event categories, pruned to what capOS actually records.
  authentication    @0;   # login, logout, auth failure
  accessControl     @1;   # grant, deny, revoke, transfer
  policyDecision    @2;   # broker decision with plan + constraints
  objectLifecycle   @3;   # capability create/destroy, object reap
  securityAlarm     @4;   # X.736-shaped: integrity/confidentiality violation
  serviceControl    @5;   # restart, upgrade, quiesce, resume
  administrative    @6;   # manifest update, role change
}

enum AuditOutcome {
  success           @0;
  failure           @1;
  denied            @2;
  pending           @3;   # multi-party approval outstanding
}

struct AuditRecord {
  tick        @0 :UInt64;
  eventType   @1 :AuditEventType;
  initiator   @2 :Data;        # opaque principal/session ID
  target      @3 :Text;        # interface + service identity
  outcome     @4 :AuditOutcome;
  reason      @5 :Text;
  evidence    @6 :Data;        # opaque, bounded; no secrets
}

Alarms (X.733) are a structured subset of Events, not a new signal class. The ServiceStatus / Health path emits alarms when degraded, failed, or security-relevant thresholds trip:

enum PerceivedSeverity {
  cleared        @0;
  indeterminate  @1;
  warning        @2;
  minor          @3;
  major          @4;
  critical       @5;
}

enum ProbableCause {
  # X.733 Annex A lists ~50 values; capOS starts with the handful that
  # match known failure modes and extends as needed.
  communicationsError    @0;
  integrityViolation     @1;
  operationalViolation   @2;
  softwareError          @3;
  underlyingResourceUnavailable @4;
  qualityOfServiceAlarm  @5;
  securityAlarmIntegrity @6;
  securityAlarmAccess    @7;
}

struct Alarm {
  tick            @0 :UInt64;
  managedObject   @1 :Text;           # service or cap identity
  severity        @2 :PerceivedSeverity;
  probableCause   @3 :ProbableCause;
  specificProblem @4 :Text;
  trend           @5 :AlarmTrend;
  proposedRepair  @6 :Text;
}

The taxonomy buys two things the Unix-style “syslog + Prometheus + Jaeger” tower does not: (1) alarms as a first-class signal with a defined severity lattice and probable-cause field, which is how operators actually triage, and (2) audit as a distinct record type with fixed fields rather than a convention-layer over free-form log messages.

ITU-T references

ITU-T Rec. X.700 (09/92) — Management framework
ITU-T Rec. X.701 (08/97) — Systems management overview
ITU-T Rec. X.733 (02/92) — Alarm reporting function
ITU-T Rec. X.734 (09/92) — Event report management function
ITU-T Rec. X.735 (09/92) — Log control function
ITU-T Rec. X.736 (01/92) — Security alarm reporting function
ITU-T Rec. X.740 (01/92) — Security audit trail function
ITU-T Rec. X.731 (01/92) — State management function
ITU-T Rec. X.739 (11/93) — Metric objects and attributes

Proposed Architecture

flowchart TD
    Kernel[Kernel primitives] --> KD[KernelDiagnostics cap]
    Kernel --> Serial[Emergency serial]

    Init[init / root supervisor] --> LogSvc[Log service]
    Init --> MetricsSvc[Metrics service]
    Init --> StatusSvc[Status service]
    Init --> AuditSvc[Audit log]
    Init --> TraceSvc[Trace capture service]

    KD --> MetricsSvc
    KD --> StatusSvc
    KD --> TraceSvc

    Services[Services and drivers] --> LogSink[Scoped LogSink caps]
    Services --> Health[Health caps]
    Services --> AuditWriter[Scoped AuditWriter caps]

    LogSink --> LogSvc
    Health --> StatusSvc
    AuditWriter --> AuditSvc

    Broker[AuthorityBroker] --> Readers[Scoped readers]
    Readers --> Shell[Shell / agent / operator tools]

    StatusSvc --> Readers
    LogSvc --> Readers
    MetricsSvc --> Readers
    TraceSvc --> Readers
    AuditSvc --> Readers

The important property is that there is no ambient monitoring namespace. The graph is assembled by init and supervisors. Readers are capabilities, not paths.

Core Interfaces

These are conceptual interfaces. They should not be added to schema/capos.capnp until the current manifest-executor work is complete and a specific implementation slice needs them.

enum Severity {
  debug @0;
  info @1;
  warn @2;
  error @3;
  critical @4;
}

struct LogRecord {
  tick @0 :UInt64;
  severity @1 :Severity;
  component @2 :Text;
  pid @3 :UInt32;
  subjectRef @4 :Data;   # privacy-preserving subject/session correlation
  sessionRef @5 :Data;   # optional scoped session correlation
  serviceRef @6 :Data;   # optional authorized service/component correlation
  transportId @7 :Data;  # debug-only ring/endpoint metadata, not identity
  message @8 :Text;
}

struct LogFilter {
  minSeverity @0 :Severity;
  componentPrefix @1 :Text;
  pid @2 :UInt32;
  includeDebug @3 :Bool;
}

interface LogSink {
  write @0 (record :LogRecord) -> ();
}

interface LogReader {
  read @0 (cursor :UInt64, maxRecords :UInt32, filter :LogFilter)
      -> (records :List(LogRecord), nextCursor :UInt64, dropped :UInt64);
}

LogSink is what ordinary services receive. LogReader is what shells, operators, supervisors, and diagnostic tools receive. A scoped reader can filter to one service subtree or session before the caller ever sees the record.

Monitoring terminology should use snake-case names in prose and map them to schema-style fields only at the Cap’n Proto boundary:

subject_ref / session_ref:
  privacy-preserving identity or session correlation fields.

service_ref:
  service instance or component correlation where the reader is authorized.

transport_id:
  debug-only ring, endpoint, SQE/CQE, or waiter metadata; never subject
  identity.

Legacy endpoint badge terminology must not leak into user-facing monitoring identity. If a low-level transport path still stores a badge-shaped selector, monitoring may expose it only as debug transport_id under an appropriate diagnostic cap, not as subject_ref, session_ref, or service_ref.

struct ProcessStatus {
  pid @0 :UInt32;
  serviceName @1 :Text;
  state @2 :Text;
  capSlotsUsed @3 :UInt32;
  capSlotsMax @4 :UInt32;
  outstandingCalls @5 :UInt32;
  cqReady @6 :UInt32;
  cqOverflow @7 :UInt64;
  lastExitCode @8 :Int64;
}

struct ServiceStatus {
  name @0 :Text;
  health @1 :Text;
  pid @2 :UInt32;
  restartCount @3 :UInt32;
  lastError @4 :Text;
}

interface SystemStatus {
  listProcesses @0 () -> (processes :List(ProcessStatus));
  listServices @1 () -> (services :List(ServiceStatus));
  service @2 (name :Text) -> (status :ServiceStatus);
}

SystemStatus is read-only. A broad instance can see the system; wrappers can expose one service, one supervision subtree, or one session.

enum MetricKind {
  counter @0;
  gauge @1;
  histogram @2;
}

struct MetricSample {
  # Well-known fixed-name slot for counters and gauges the aggregator
  # understands without additional schema lookup. Use this for stable
  # kernel counters to keep the hot path allocation-free.
  name @0 :Text;
  kind @1 :MetricKind;
  value @2 :Int64;
  tick @3 :UInt64;

  # Producer-scoped typed envelope for richer samples (histograms,
  # top-k tables, per-subsystem structs). Payload is a capnp message;
  # the schema is identified by `schemaHash` (capnp node id) and keyed
  # per producer. Opaque to the generic reader; a schema-aware viewer
  # decodes it.
  producerId @4 :UInt64;
  schemaHash @5 :UInt64;
  payload    @6 :Data;
}

struct MetricFilter {
  prefix @0 :Text;
  service @1 :Text;
}

interface MetricsReader {
  snapshot @0 (filter :MetricFilter, maxSamples :UInt32)
      -> (samples :List(MetricSample), truncated :Bool);
}

Early metrics should be fixed-name counters and gauges in the name/value slot. Avoid arbitrary labels until there is a concrete memory and cardinality policy. The producer-scoped envelope exists so richer samples do not force the generic reader to learn a string-key taxonomy — if a producer needs per-queue or per-device detail, it ships a typed capnp struct keyed by schemaHash rather than synthesizing name strings.

struct TraceSelector {
  pid @0 :UInt32;
  serviceName @1 :Text;
  errorCode @2 :Int32;
  includePayloadBytes @3 :Bool;
}

struct TraceRecord {
  tick @0 :UInt64;
  pid @1 :UInt32;
  opcode @2 :UInt16;
  capId @3 :UInt32;
  methodId @4 :UInt16;
  interfaceId @5 :UInt64;
  result @6 :Int32;
  flags @7 :UInt16;
  payload @8 :Data;
}

interface TraceCapture {
  arm @0 (selector :TraceSelector, maxRecords :UInt32, maxBytes :UInt32)
      -> (captureId :UInt64);
  drain @1 (captureId :UInt64, maxRecords :UInt32)
      -> (records :List(TraceRecord), complete :Bool, dropped :UInt64);
}

Payload capture should default off. A capture cap that can read payload bytes is closer to a debug privilege than a normal status cap.

enum HealthState {
  starting @0;
  ready @1;
  degraded @2;
  draining @3;
  failed @4;
  stopped @5;
}

interface Health {
  check @0 () -> (state :HealthState, reason :Text);
}

interface ServiceSupervisor {
  status @0 () -> (status :ServiceStatus);
  restart @1 () -> ();
}

ServiceSupervisor is authority-changing. Normal monitoring readers should not receive it. A broker can mint a leased ServiceSupervisor(net-stack) for one operator action.

Kernel Diagnostics Contract

The kernel should expose a small read-only diagnostics surface for facts only the kernel can know:

process table snapshot: pid, state, service name if known, wait state, exit code, ring physical identity hidden or omitted;
ring snapshot: SQ/CQ head/tail-derived occupancy, overflow count, corrupted head/tail recovery counts, opcode/error counters;
resource snapshot: cap slot usage, outstanding calls, scratch reservation, frame grant pages, mapped VM pages, free frame count, heap pressure;
scheduler snapshot: tick count, current pid, run queue length, blocked count, direct IPC handoff count, timeout wake count;
crash record: last panic/fault metadata and early boot stage.

The kernel should not implement log routing, alerting, dashboards, retention policy, restart decisions, RBAC, ABAC, or text search. Those are userspace service responsibilities.

Implementation shape:

Maintain fixed-size counters in existing kernel structures where the source event already occurs.
Prefer snapshots computed from existing state over duplicate counters when the cost is bounded.
Expose snapshots through a small set of narrow read-only capabilities, not one KernelDiagnostics god-cap. The initial decomposition:
- SchedStats — tick count, current pid, run queue length, blocked count, direct IPC handoff count, cap_enter timeout/wake counts.
- FrameStats — free/used frame counts, frame-grant pages, allocator pressure histogram.
- RingStats — per-process SQ/CQ occupancy, cq_overflow, corrupted-head recovery counts, opcode counters, transport-error counters.
- CapTableStats — per-process slot occupancy, generation-rollover counts, insertion/remove rates.
- EndpointStats — per-endpoint waiter depth, RECV/RETURN match rate, abort/cancellation counts.
- CrashSnapshot — last panic/fault metadata, early boot stage, recent SQE context when safe.
Each narrow cap exposes snapshot() -> (sample :MetricSample) or a typed struct; none of them enumerates processes or reads cap tables beyond what the subsystem owns. A trusted status service composes the ones it needs; a broker leases a subset for operator sessions without the rest.
ProcessInspector (pid-scoped process table, cap-table enumeration, VM map) is a distinct, stronger cap and lives with process-management authority, not with monitoring.
Convert broad diagnostics into scoped userspace wrappers before handing them to shells or applications.
Keep panic/fault serial writes independent of any diagnostics service.

Promotion from the measure feature: the benchmark counters in kernel/src/measure.rs graduate to always-on in RingStats / SchedStats when the per-event cost is provably a single relaxed atomic add. Cycle-counter instrumentation (rdtsc/rdtscp) stays behind cfg(feature = "measure") because it is serializing and benchmark-only. The promotion threshold keeps normal dispatch builds free of instrumentation cost without forcing monitoring into a second build configuration.

Logging Model

Early boot has only serial. After init starts the log service, ordinary services should receive LogSink rather than raw Console unless they need emergency console access.

Recommended path:

Kernel serial remains for boot, panic, and fault records.
Init starts a userspace log service and passes scoped LogSink caps to children.
The log service forwards selected records to Console until persistent storage exists.
SystemConfig.logLevel becomes an initial policy input for which records the log service forwards and retains.
Session and operator tools receive scoped LogReader caps from a broker.

Services should not put secrets, raw capability payloads, full auth proofs, or arbitrary user input into logs without explicit redaction. Log records are data, not commands.

Metrics and Status

Status answers “what is alive and what state is it in.” Metrics answer “what is the numeric behavior over time.” Keeping them separate avoids a common failure mode where a human-readable status API grows into an unbounded metrics store.

Initial status fields should cover:

pid, service name, binary name, process state, exit code;
process handle wait state;
supervisor health and restart policy once supervision exists;
cap table occupancy and outstanding call count;
ring CQ availability and overflow;
endpoint queue occupancy where authorized.

Initial metrics should cover:

ring dispatches, SQEs processed, per-op counts, transport error counts;
cap-enter wait count, timeout count, wake count;
scheduler context switches and direct IPC handoffs;
frame free/used counts, frame grant pages, VM mapped pages;
log records accepted, suppressed, dropped, and forwarded;
trace records captured and dropped.

Timer/nohz/realtime metrics should be owned by monitoring rather than left as one-off debug prints once those features exist:

scheduler_tick_count{cpu};
ticks_suppressed{cpu,mode};
nohz_enter_count{cpu,kind};
nohz_exit_count{cpu,reason};
oneshot_deadline_miss_count;
sqpoll_busy_ns;
sqpoll_sleep_count;
deadline_expired_count;
budget_exhausted_count;
realtime_overrun_count;
donation_depth_max;
housekeeping_offload_count.

These are correctness signals for nohz/realtime admission, not only performance counters. A scoped monitoring reader may observe them only under the same authority rules as other scheduler and service telemetry.

Current state alignment. Scheduler Phase D WFQ and Phase E SchedulingContext have landed per docs/changelog.md (Phase D closed 2026-05-10), and Phase F is delivering one-SQ-consumer, nohz telemetry counters, and housekeeping/deferred-work placement; automatic nohz activation’s first increment is now closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md (per the scheduler bullet in docs/tasks/README.md), and SQPOLL-driven auto-nohz activation is also closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md: a ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL running/sleeping mode with a live owner is admitted for tick suppression, with the SQPOLL ring-state re-check as the decisive rollback gate; the CpuIsolationLease preflight performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window with fail-closed rollback; timeout-based auto-revoke and generic full-nohz for ordinary budgeted compute leases are also landed. The nohz/realtime counter families above describe the target monitoring surface for those signals — the kernel may already maintain some counters internally as Phase F lands them, but until the narrow read-only stats caps (SchedStats / RingStats and friends) and a userspace metrics service ship, those counters are scheduler-internal facts and not yet exported through a monitoring cap. The metrics service is not authority to trigger nohz mode changes; it observes counters under the authority rules in this proposal.

Metric labels such as mode, kind, and reason must be fixed enums, not free-form strings:

#![allow(unused)]
fn main() {
enum NoHzKind {
    Idle,
    KernelSqpoll,
    AutoCompute,
    AutoUserspacePoller,
    RealtimeIsland,
}

enum TickSuppressionMode {
    Idle,
    SqpollNoHz,
    AutoNoHz,
    RealtimeIsland,
}

enum NoHzExitReason {
    TimerDeadline,
    Ipi,
    DeviceIrq,
    SecondRunnable,
    NetworkForcedPeriodic,
    DeferredWork,
    LeaseRevoked,
    ClocksourceUnsafe,
    DebugWatchdog,
}
}

Future metric schemas should add enum variants through reviewed ABI changes rather than accepting arbitrary labels.

Avoid per-method, per-cap-id, per-transport-id, or per-user high-cardinality metrics by default. Those belong in short-lived traces or scoped logs.

Benchmark outputs follow the same cardinality rule. A completed, validated benchmark run may import a small summary such as latest median, p95, sample count, and pass/fail status for a named benchmark profile. Raw samples, transcripts, host/QEMU configuration, correctness evidence, and comparison tables are benchmark artifacts, not always-on monitoring metrics. Running a profile that needs measure, debug taps, broad status readers, or other diagnostic authority should emit an audit record because the act of measuring can expose timing and topology data that ordinary services should not see.

Ring as Black Box

The first concrete monitoring milestone is the completed docs/tasks/README.md Ring-as-Black-Box item. The visible milestone was achieved by commit da5f5e9 at 2026-04-24 03:13 UTC:

define a bounded capture format for SQE/CQE records;
export capture through a QEMU-only debug path;
build a host-side viewer that decodes records and capnp payloads when payload capture is authorized;
add one failing-call smoke whose captured log can be inspected offline.

This buys immediate debugging value without committing to durable audit, network export, service restart policy, or replay semantics.

This is inspection, not record/replay. Replay requires stronger determinism, payload retention, timer/input modeling, and capability-state checkpoints.

Capture path cost. The capture cap (working name RingTap) is feature-gated (cfg(feature = "debug_tap") analogous to measure). Every armed tap imposes a serializing fan-out on dispatch; keeping it out of the default kernel feature set prevents always-on cost. Arming a tap is itself an auditable event — the tapped process and the audit log observe it — and tap grants respect move-semantics so a tap cannot be silently cloned past its intended holder. Payload-capturing taps require a separately leased cap distinct from metadata-only capture because payloads may contain secrets.

Health and Supervision

Health and restart policy should live with supervisors, not in a central kernel daemon.

Each supervisor owns:

a narrowed ProcessSpawner;
child ProcessHandle caps;
the cap bundle needed to restart its subtree;
optional Health caps exported by children;
a LogSink and AuditWriter for its own decisions.

Status services aggregate supervisor-reported health. They should distinguish:

no process exists;
process exists but never reported ready;
process is alive and ready;
process is alive but degraded;
process exited normally;
process failed and supervisor is backing off;
process was intentionally stopped or draining.

Restart authority should be a separate ServiceSupervisor cap. A read-only SystemStatus cap must not be able to restart anything.

Audit Integration

Audit should share infrastructure with logging only at the storage or transport layer. Its semantics are different.

Audit producers:

AuthorityBroker for policy decisions and leased grants;
supervisors for restarts and service lifecycle actions;
session manager for session creation and logout;
kernel or status service for cap transfer/release/revocation summaries when those events become part of the exported authority graph;
recovery tools for repair actions.

Audit readers are scoped:

a user can read records for its own session;
an operator can read a service subtree;
a recovery or security role can read broader streams after policy approval.

Audit entries must avoid secrets and payload dumps. They should record object identity, service identity, policy decision summaries, and capability interface classes rather than raw data.

Security and Backpressure

Monitoring must not become the easiest denial-of-service path.

Required controls:

Per-process log token buckets, matching the Security Verification Track S.9 diagnostic aggregation design.
Suppression summaries for repeated invalid submissions.
Fixed-size ring buffers with explicit dropped counts.
Maximum record size for logs, events, crash records, and traces.
Bounded formatting outside interrupt context.
No heap allocation in timer or panic paths.
No unbounded metric label creation from user-controlled strings.
Payload tracing disabled by default.
Redaction rules at producer boundaries and at reader wrappers.
Capability-scoped readers; no unauthenticated “debug all” endpoint.

When pressure forces dropping, preserve first-observation diagnostics and later summaries. Losing detailed logs is acceptable; corrupting scheduler progress or blocking the kernel on log I/O is not.

Relationship to Existing Proposals

Service Architecture: monitoring services are ordinary userspace services spawned by init or supervisors. Logging policy and service topology stay out of the pre-init kernel path.
Shell: the native and agent shell should receive scoped SystemStatus and LogReader caps in daily profiles, not global supervisor authority.
User Identity and Policy: AuthorityBroker mints scoped readers and leased supervisor caps based on session policy; AuditLog records the decisions.
Error Handling: transport errors and CapException payloads are monitoring signals, but retry policy remains userspace.
Authority Accounting: resource ledgers provide the first metrics substrate and define quota/backpressure boundaries.
Security and Verification: hostile-input tests should cover log flood aggregation and bounded diagnostic paths. Each new monitoring boundary (kernel stats caps, log/metrics/trace/audit services, scheduler nohz telemetry exports) must be carried into the docs/proposals/security-and-verification-proposal.md Track S.7 trust-boundary inventory before downstream services rely on it; the inventory is the canonical record that a boundary has been reviewed, not this proposal.
Live Upgrade: health, audit, and service status become prerequisites for credible upgrade orchestration.
System Performance Benchmarks: benchmark runners may read scoped status and metrics before and after a run, but benchmark artifacts and OS-comparison reports live outside the always-on metrics service. Only low-cardinality, validated summaries should be imported into monitoring.

Implementation Plan

Document the model. Keep monitoring as a future architecture proposal and do not disturb the current manifest-executor milestone.
Ring as Black Box. Completed by commit da5f5e9 at 2026-04-24 03:13 UTC: bounded SQE/CQE capture, host-side decoding, and one failing-call smoke form the first useful monitoring artifact.
Userspace log service. (Phase 1 landed.) LogSink/LogReader schemas plus LogRecord/LogFilter exist (additive ordinals, reusing LogLevel as the severity type). A bounded drop-oldest kernel ring (kernel/src/cap/log.rs) backs both caps: the sink stamps the monotonic tick, drops records below the boot-seeded SystemConfig.logLevel threshold (accepted = false), bounds record size, and forwards accepted records to serial; the reader returns cursor/filtered records with nextCursor and a dropped overflow count. Scoped LogSink/LogReader caps are granted to children at spawn; make run-monitoring-log-smoke proves the drop, the read-back, and the reader-side minLevel filter. Remaining: the wider Severity (with critical), the correlation fields (subjectRef/sessionRef/serviceRef/transportId), per-process token buckets / suppression summaries, and persistent retention.
Narrow kernel stats caps and SystemStatus. Add the narrow read-only caps (SchedStats, FrameStats, RingStats, CapTableStats, EndpointStats, CrashSnapshot) as bounded snapshot surfaces. A userspace SystemStatus service composes the ones it needs and exposes scoped wrappers to shells and operator tools. Leave ProcessInspector out of this step — it belongs with process-management authority, not monitoring.
Metrics snapshots. Add fixed counters and gauges for ring, scheduler, resource, log, and trace state. Keep labels static until a cardinality policy exists.
Health and supervisor status. Add Health and read-only supervisor status once restart policy and exported service caps are concrete. Keep restart authority in separate ServiceSupervisor caps.
Audit path. Add append-only audit records for broker decisions, cap grants, releases, revocations, restarts, and recovery actions. Start serial or memory backed; move to storage once the storage substrate exists.
Crash records. Preserve bounded panic/fault metadata across the current boot where possible; later store records durably.
Device, network, and storage metrics. Add driver metrics only after those drivers exist: interrupts, DMA/bounce usage, queue depth, RX/TX/drop/error counts, block latency, and reset events.

Non-Goals

No global /proc or /sys equivalent with ambient read access.
No kernel-resident dashboard, alert manager, text search, or policy engine.
No programmable kernel tracing language in the first monitoring design.
No promise of durable log retention before storage exists.
No default payload tracing.
No service restart authority bundled into ordinary read-only status caps.
No network export path until networking and policy can constrain it.

Open Questions

Should KernelDiagnostics expose snapshots only, or also a bounded event cursor?
What is the minimum timestamp model before wall-clock time exists?
Should log records carry local cap IDs, stable object IDs, or only interface and service metadata by default?
How should schema-aware trace decoding find schemas before a full SchemaRegistry exists?
Which crash fields are safe to expose to non-recovery sessions?
What retention policy is acceptable before persistent storage?
Should MetricsReader use typed structs for each subsystem instead of generic name/value samples?
Where should remote monitoring export fit once network transparency exists: a dedicated exporter service, capnp-rpc forwarding, or storage replication?

Cross-References

This proposal is reader-facing target design. The canonical trackers for the observability-adjacent risks and verification obligations it depends on live elsewhere:

docs/proposals/security-and-verification-proposal.md Track S.7 – Stage-6-aware refresh owns the trust-boundary inventory that any new monitoring boundary (kernel stats caps, log/metrics/trace/audit services, scheduler nohz telemetry exports, payload-capturing taps) must be carried into before downstream services rely on it. Track S.7 already lists the active scheduler-evolution surfaces (Phase D WFQ, Phase E SchedulingContext, Phase F one-SQ-consumer and nohz telemetry) plus the WASI host-adapter Phase W.4 entropy/argv boundary as inventory items to carry forward.
docs/design-risks-register.md R12 – Verification coverage is partial, not full proof is the canonical caveat for any monitoring claim that could be read as a verified property. Bounded Kani/Loom/Miri/proptest coverage plus the panic-surface inventory are not whole-system functional refinement; monitoring records and audit entries describing security- relevant decisions must respect that distinction in their wording.
docs/design-risks-register.md Q9 – CPU accounting and scheduling contexts is the canonical answer for the CPU-time, weighted-vruntime, and SchedulingContext budget/donation/depletion semantics that monitoring metrics should observe rather than redefine. The nohz/realtime counter families in this proposal target the same surfaces; cross-service donation policy, full nohz activation, isolation leases, and fairness across principals remain proposal-shaped per Q9 and are tracked in docs/proposals/scheduler-evolution-proposal.md and docs/backlog/scheduler-evolution.md.

Adjacent risk-register entries observed by monitoring but owned elsewhere include R4 (Resource accounting fragmentation, source of the ResourceLedger metrics substrate), R8 (Networking lives inside the kernel TCB, gating exporter-service placement), and R11 (Pre-auth and post-auth share a shell process, gating who may receive scoped LogReader / SystemStatus / AuditLog readers).

Keyboard shortcuts

capOS Documentation