Proposal: System Performance Benchmarks

How capOS should benchmark system performance against other operating systems without producing misleading numbers, rewarding special-case optimizations, or treating speed as a substitute for correct capability behavior.

Problem

capOS already has smoke tests, QEMU boot proofs, ring-tap debugging, and a measure feature for focused cycle measurements. Those are necessary, but they do not answer the product-level question: can capOS remain effective on common workloads while preserving its capability model?

Generic OS benchmark suites are useful but dangerous in this project. Most assume POSIX process, file, pipe, socket, and shell semantics. capOS should not fake broad ambient Unix authority just to run a familiar benchmark. It also should not compare a capability-native path against Linux, FreeBSD, or a microkernel by publishing a single blended score that hides unsupported semantics, incorrect outputs, or different isolation boundaries.

The benchmark system needs to produce three kinds of evidence:

Primitive cost: capability calls, IPC, scheduling, park waits, VM changes, process creation, memory copy, and later device I/O.
Common workload adequacy: database, compression, build, network, storage, shell/session, service graph, and runtime workloads that users recognize.
Correctness under load: workload outputs, service boundaries, capability denial paths, and data integrity must remain correct while performance is measured.

Current State

Implemented measurement and comparison hooks:

make run-measure builds a separate measurement kernel feature and boots system-measure.cue.
kernel/src/measure.rs records benchmark-only dispatch counters and cycle segments for ring processing, SQE validation, cap lookup, Cap’n Proto encode/decode, method body dispatch, CQE posting, and waiter wake checks.
The measurement manifest grants ring-nop a measurement-only NullCap and ParkBench capability through ProcessSpawner.
demos/ring-nop measures CAP_OP_NOP, empty and small NullCap calls, and compact-versus-generic park-shaped operations.
demos/thread-lifecycle measures private ParkSpace failed wait, empty wake, wait-to-block, wake-to-runnable, and wake-to-resume paths.
make run-smoke, make run-spawn, make run-net, and focused service smokes provide correctness and user-visible behavior proofs, but they do not yet emit structured performance results.

That is enough for local dispatch decisions. It is not enough for comparing capOS with Linux, FreeBSD, seL4-based systems, Genode scenarios, or other OS baselines on common workloads.

Design Principles

Correctness gates first. A benchmark result is publishable only when the workload’s output verifier passes and capOS-specific authority checks still hold.
No semantic laundering. Unsupported POSIX features are reported as unsupported or not applicable, not silently emulated through broad authority.
Benchmark artifacts are not normal metrics. Always-on monitoring may expose low-cost counters. Benchmark logs, raw samples, host configuration, and per-run outputs are retained as explicit benchmark artifacts.
Compare like mechanisms where possible. Compare capOS capability IPC to Linux pipes, Unix domain sockets, io_uring, or futexes only when the semantic differences are declared in the result.
Use common suites as references, not design masters. lmbench, UnixBench, fio, iperf3, SQLite speedtest, Phoronix/OpenBenchmarking profiles, and SPEC CPU are valuable precedent. capOS should adopt their methodology where it fits and reject assumptions that would distort capOS.
Publish raw context. Results include kernel commit, manifest, QEMU command, CPU model, host OS, compiler, build flags, feature flags, warmup, run count, and raw logs.
Separate hosted and native comparisons. Early capOS runs in QEMU. Compare against Linux/FreeBSD guests under the same QEMU/KVM envelope, and separately against native host OS runs when the question is absolute hardware performance.
Regression gates are narrower than claims. CI gates should catch local regressions in stable paths. Public OS comparisons need controlled machines, repeated runs, and manual review.
Security posture is part of the result. A fast result that requires a broader cap bundle, disabled validation, payload tracing, or a special kernel build must be labeled as such.
No single score. capOS should publish a matrix of workload results and ratios, not an aggregate score that implies all workloads matter equally.

Benchmark Tiers

Tier 0: Existing Correctness Smokes

Tier 0 is not a performance suite. It is the mandatory correctness floor:

default boot/login/shell smoke;
focused spawn, shell, terminal, credential, login, chat, adventure, revocable-read, memory-object, ringtap, networking, and measurement smokes;
host tests for config, ring Loom, capos-lib, mkmanifest, generated code, and runtime surface checks.

No performance result should be retained when the relevant Tier 0 proof fails.

Tier 1: capOS-Native Primitive Benchmarks

These benchmarks measure the cost of capOS mechanisms directly:

Area	Initial measurements	Correctness condition
Ring transport	`CAP_OP_NOP`, empty `NullCap`, small payload `NullCap`, CQE post	expected CQE result, no overflow, bounded dropped count
Cap dispatch	cap lookup, generation rejection, revoked cap rejection, invalid method	correct `CAP_ERR_*` or `CapException`
IPC	endpoint CALL/RECV/RETURN round trip, direct handoff, transfer copy/move	reply payload and transferred-cap identity match oracle
Park/threading	failed wait, timeout, wake-one, wake-many, wake-to-resume	waiter count and join status match oracle
Scheduler	context switch latency, timer wake latency, direct IPC handoff latency	no runnable-thread loss or unexpected starvation
Process lifecycle	spawn, ELF load, wait, failed spawn rejection	child output and exit code match manifest oracle
VM/memory	map/protect/unmap, MemoryObject map, frame allocation/free	data visibility, W^X, quota, and cleanup checks pass
Terminal/session	readLine/write latency and throughput under foreground ownership	echo/cancellation/stale-input checks pass

These are capOS results first. Linux or FreeBSD baselines can use matching native mechanisms, but the report must describe the mapping. For example, a capOS endpoint IPC round trip can be compared with Linux pipe, Unix-domain socket, eventfd, or futex ping-pong results, but none is a perfect semantic match.

Tier 2: Translated OS Microbenchmarks

lmbench and UnixBench are useful because they isolate OS primitives such as system-call overhead, process creation, context switching, pipes, networking, and filesystem reads. They are also Unix-shaped.

capOS should implement a capos-osbench harness that translates the benchmark intent into capability-native operations:

fork/exec/wait intent becomes ProcessSpawner.spawn plus ProcessHandle.wait.
pipe throughput/context switching becomes Endpoint or a future byte-stream or socket capability round trip, labeled by transport.
getpid syscall overhead becomes a minimal kernel fact cap or CAP_OP_NOP, labeled as “capOS ring entry” rather than “POSIX syscall”.
file reread and mmap benchmarks remain unsupported until Store/Namespace and file-backed mappings exist.
networking tests map to TcpSocket/TcpListener once the Telnet and socket capability work lands.

The translated suite must emit not_applicable for missing capability subsystems instead of adding compatibility shims that change the OS being measured.

Tier 3: Portable Common Workloads

These benchmarks answer whether capOS is useful on recognizable work:

Workload	Candidate benchmark	capOS prerequisite	Result verifier
SQLite database	SQLite `speedtest1`, optionally via a Phoronix profile on reference OSes	C runtime or native port, Store/Namespace or RAM-backed DB	SQLite exit status, optional SQL result checksum
OLTP database	TPC-C/TPC-E-inspired profile, not an official TPC result until disclosure and durability rules are met	durable Store/block I/O, SQL/database stack, transaction integrity, terminal/client driver model	committed transaction counts, invariant checks, ACID/error-injection proof
Decision-support database	TPC-H/TPC-DS-inspired profile at declared scale factors, not an official TPC result until rules are met	SQL/query engine, bulk data load, durable or explicitly memory-backed storage, query result verifier	query answer hashes, load status, scale factor, refresh/query stream status
Key-value serving	YCSB-style read/update/scan/insert mixes	Store/Namespace, KV service, stable client driver	operation counts, latency distribution, value/hash verifier
Storage engine	RocksDB/LevelDB `db_bench`-style fill/read/overwrite/seek profiles	file/store semantics, fsync/sync policy, storage engine port	key/value integrity, database reopen, configured write durability
Compression	`xz`, `zstd`, or small native compressor corpus	C/Rust userspace runtime and file/store access	compressed output hash and decompression hash
Build/developer workload	small Rust/C package build, later IX package build	process spawning, Store/Namespace, toolchain support	output artifact hash and build log status
Network throughput	iperf3-equivalent TCP stream and request/response latency	`TcpSocket`, network harness	byte count, JSON/structured summary, peer checksum
Storage I/O	fio-equivalent sequential/random read/write, verify mode	block device, Store/Namespace, direct I/O policy	fio-style verify/checksum result
File service	SPECstorage-inspired workload profile	network filesystem or capOS file-service equivalent, durable storage, client load generation	throughput, response time, data integrity
Java/server runtime	SPECjbb 2015 or Renaissance-inspired profiles	JVM or Java compatibility profile, timers, threads, networking/storage as needed	benchmark verifier and SLA/throughput summary
HTTP service	`wrk`-style request load against a capOS HTTP service	TCP, HTTP service, stable response corpus	response checksum/status mix, latency distribution, error rate
Cloud services	CloudSuite-inspired data caching/serving/search/web profiles	multi-service graph, storage/network/runtime support	workload-specific answer checks and service SLOs
Microservices	DeathStarBench/TailBench-inspired tail-latency profiles	service graph, network or local RPC, load generator, tracing/status caps	request correctness, p95/p99 latency, no unauthorized cap exposure
ML storage	MLPerf Storage-inspired data feeding profile	high-throughput storage path, dataset loader, accelerator or simulated training reader	records/images delivered, latency/throughput, data checksum
ML inference/training	MLPerf-inspired inference/training profile	model runtime, accelerator/GPU capability or CPU baseline, dataset and accuracy harness	accuracy/quality target plus throughput or time-to-train
Shell/session	boot-to-shell, Telnet shell, command launch latency	current shell plus terminal/socket path	transcript oracle and authority denial checks
Service graph	chat/adventure/resident service load	shared-service demos	scripted transcript and service identity checks
Runtime/library	Go/Lua/Wasm micro and app kernels	relevant runtime proposal milestones	language-level test suite or checksum oracle

Early capOS should start with RAM-backed variants where storage is not ready, but those results must be labeled as memory-backed. A RAM-backed database result does not compare to a Linux disk-backed SQLite result.

Industry benchmark families belong later than SQLite speedtest and simple compression/build profiles. TPC-C/TPC-E and TPC-H/TPC-DS are database-system references with strict workload, disclosure, pricing, and correctness expectations. SPEC, MLPerf, CloudSuite, TailBench, and DeathStarBench bring similar setup and disclosure obligations in their domains. capOS can use inspired profiles to exercise the same workload classes before it can make official or directly comparable claims, but reports must label them as such and state which upstream rules are not yet satisfied.

Tier 4: User-Story Benchmarks

User-story benchmarks measure complete workflows that a person, operator, or service owner would recognize. They are intentionally broader than a single primitive or portable benchmark profile, and they should be described by the user outcome they prove rather than by the current demo implementation.

Initial user stories:

Story	Example capOS proof	Result verifier
Start a local session	boot to an interactive shell or terminal prompt	transcript reaches ready prompt with expected cap bundle
Authenticate and receive authority	anonymous session upgrades to an operator/session profile	wrong credential denied, right credential grants exact profile
Run a delegated task	launch a child process with a narrow cap bundle	child output, exit code, and denied extra authority match oracle
Use a remote terminal	host-local TCP terminal reaches the same shell/session model	connect, authenticate, run command, clean disconnect
Use a resident service	client talks to a long-running service through scoped authority	request/reply transcript and service-visible identity match oracle
Serve a network request	network-facing service handles requests while local work continues	response checksum, latency, and no unauthorized cap exposure
Complete a developer workflow	build or transform an artifact from declared inputs	output hash, logs, and resource profile match declared policy
Recover from expected failure	service fault, rejected grant, timeout, or restart path	failure is bounded, audited, and visible through status

User-story results report latency distribution, success rate, resource usage, and authority outcome. They are the closest evidence for “effective on common workloads,” but they are not substitutes for primitive measurements when a regression appears.

Reference Operating Systems

Initial comparisons should use these environments:

Reference	Why include it	Caveat
Linux guest under same QEMU/KVM flags	Stable baseline with broad benchmark support	Linux has mature drivers, filesystems, VM, scheduler, and libc
FreeBSD guest under same QEMU/KVM flags	Second mature Unix-like baseline, useful for POSIX-independent signal	Not every benchmark profile has equal FreeBSD support
Linux native host	Shows absolute host hardware ceiling	Not directly comparable to capOS-in-QEMU latency
seL4 or Genode reports/scenarios	Prior art for capability/microkernel IPC and service decomposition	Often not the same hardware, workload, or application stack

The default published table should show capOS versus Linux guest first. Native host and external microkernel data belong in separate context columns, not the primary ratio.

Correctness Model

Every benchmark definition carries:

expected input corpus hash;
command or manifest used to run the workload;
output verifier;
allowed nondeterminism, such as timestamps or generated IDs;
capOS authority profile;
unsupported-feature policy;
result parser version.

A result is invalid when:

the output verifier fails;
QEMU exits abnormally;
the kernel panics or reports an unexpected fault;
the benchmark had to grant broader authority than its declared profile;
host logs show dropped records that invalidate the measurement;
the run used a special fast path not available in the declared configuration;
the reference OS result used a materially different workload size or dataset.

Correctness should be stored alongside the performance value. A fast failed run is not a slow successful run; it is no result.

Measurement Method

Controlled runs should use:

fixed capOS commit, reference OS image hash, benchmark source hash, compiler version, and toolchain flags;
fixed QEMU version, machine type, CPU model, memory size, SMP count, KVM/TCG mode, disk image type, and network backend;
warmup runs for workloads with caches, JITs, connection setup, or first-use allocation;
at least 5 measured runs for primitive and user-story benchmarks, more when coefficient of variation is high;
median, min, max, standard deviation, and p95/p99 for latency where sample count supports it;
raw logs retained for the benchmark artifact;
no performance claim from one isolated run unless explicitly labeled as a smoke measurement.

Cycle-counter measurements remain inside cfg(feature = "measure") and are used for relative path decisions. Wall-clock user-story and workload comparisons use host-side timestamps around QEMU transcripts or in-guest monotonic timers when the timer contract is adequate.

Result Schema

The benchmark harness should emit a structured artifact, not a free-form log:

enum BenchmarkStatus {
  passed       @0;
  failed       @1;
  unsupported  @2;
  invalid      @3;
}

struct BenchmarkResult {
  runId          @0 :Text;
  benchmarkName  @1 :Text;
  tier           @2 :UInt16;
  status         @3 :BenchmarkStatus;
  correctnessId  @4 :Text;
  configHash     @5 :Data;
  artifactHash   @6 :Data;
  notes          @7 :Text;

  result :union {
    measurement @8 :MeasurementSummary;
    failure     @9 :RunFailure;
    unsupported @10 :RunFailure;
    invalid     @11 :RunFailure;
  }
}

struct MeasurementSummary {
  unit           @0 :Text;
  lowerIsBetter  @1 :Bool;
  median         @2 :Float64;
  p95            @3 :Float64;
  samples        @4 :List(Float64);
}

struct RunFailure {
  reason  @0 :Text;
  detail  @1 :Text;
}

This schema is conceptual. It should not be added to schema/capos.capnp until a concrete benchmark-runner service exists. The important property is that measurement values exist only in the passed/publishable branch; failed, unsupported, and invalid runs carry reasons instead of zero-valued scalar defaults. Before that, host scripts can emit JSON with the same shape.

Integration With System Monitoring

System Monitoring should expose operational state; the benchmark system should store explicit run artifacts. The overlap is narrow:

benchmark runs may read scoped MetricsReader, SystemStatus, RingStats, SchedStats, and later device stats before and after a run;
benchmark summaries may be imported into a metrics service as low-cardinality gauges such as benchmark.last_median_ms, keyed by benchmark name and profile, after validation;
raw samples, transcripts, QEMU logs, host environment, and correctness evidence belong in a BenchmarkStore or CI artifact store, not in always-on metrics;
starting a privileged benchmark profile is an auditable event because it may require measurement-only caps, debug taps, or broad status readers;
benchmark readers should receive scoped read-only caps, not global monitoring roots.

The existing system-monitoring-proposal.md boundary remains correct: cycle-counter instrumentation stays behind measure, while cheap counters can later graduate into narrow stats caps.

External Grounding

Relevant local design grounding:

docs/build-run-test.md
docs/status.md
docs/proposals/system-monitoring-proposal.md
docs/architecture/capability-ring.md
docs/architecture/park.md
docs/architecture/scheduling.md
docs/research/sel4.md
docs/research/zircon.md
docs/research/genode.md
docs/research/out-of-kernel-scheduling.md

External sources checked:

USENIX lmbench paper page: https://www.usenix.org/conference/usenix-1996-annual-technical-conference/lmbench-portable-tools-performance-analysis
fio documentation: https://fio.readthedocs.io/en/master/fio_doc.html
iperf3 documentation: https://software.es.net/iperf/
SPEC CPU 2017 overview and run rules: https://www.spec.org/osg/cpu2017/ and https://www.spec.org/cpu2017/Docs/runrules.html
Byte UnixBench repository: https://github.com/kdlucas/byte-unixbench
SQLite testing documentation and OpenBenchmarking SQLite speedtest profile: https://www.sqlite.org/testing.html and https://openbenchmarking.org/test/pts/sqlite-speedtest
TPC benchmark overview, TPC-C, TPC-H, and TPC-DS descriptions: https://www.tpc.org/information/benchmarks5.asp, https://www.tpc.org/tpcc/default5.asp, https://www.tpc.org/tpch/default5.asp, and https://www.tpc.org/tpcds/
YCSB and storage-engine benchmark references: https://hse-project.github.io/apps/ycsb/, https://github.com/facebook/rocksdb/wiki/Benchmarking-tools, and https://github.com/google/leveldb
SPECjbb 2015, Renaissance, and HTTP service benchmark references: https://www.spec.org/jbb2015/, https://renaissance.dev/, and https://github.com/wg/wrk
Cloud/service benchmark references: https://github.com/parsa-epfl/cloudsuite, https://github.com/delimitrou/DeathStarBench, and https://tailbench.csail.mit.edu/
Storage and ML benchmark references: https://www.spec.org/storage2020/, https://mlcommons.org/working-groups/benchmarks/storage/, https://mlcommons.org/benchmarks/training/, and https://docs.mlcommons.org/inference/index_gh/
OpenBenchmarking test-suite/profile descriptions: https://openbenchmarking.org/suites/ and https://openbenchmarking.org/tests

The relevant lessons are straightforward:

lmbench isolates OS primitives from larger application behavior and was explicitly used to compare system implementations.
fio and iperf3 provide flexible, parameterized I/O and network workload models with machine-readable output and verification options.
SPEC CPU’s run rules show why disclosure, correct output, and configuration control matter when publishing comparative results.
UnixBench is useful as a historical system benchmark, but its own workload descriptions reveal Unix assumptions that capOS must translate carefully.
SQLite speedtest is a recognizable application workload with broad public baseline data, but database benchmarking must distinguish RAM-backed and storage-backed results.
TPC-C/TPC-E and TPC-H/TPC-DS are the right industry references for later OLTP and decision-support database claims, but capOS should treat early runs as TPC-inspired unless it can satisfy the relevant TPC rules and disclosure requirements.
YCSB and db_bench are useful earlier data-system pressure tests because they can exercise key-value, read/write mix, and storage-engine behavior before capOS has a full SQL system.
SPECjbb and Renaissance become relevant only when a Java profile exists; until then they are runtime targets, not near-term OS benchmarks.
CloudSuite, DeathStarBench, and TailBench are good references for cloud, microservice, and tail-latency user stories, but they require a mature service graph, load generation, and workload-specific correctness checks.
SPECstorage and MLPerf Storage are later storage references once capOS has durable storage and enough client/load infrastructure to avoid misleading fio-only claims.
MLPerf inference/training is relevant only after model runtimes and accelerator or CPU-baseline execution are credible, and any result must carry the benchmark’s accuracy or quality target rather than only throughput.
OpenBenchmarking/Phoronix-style test profiles are useful precedent for packaging benchmark definitions separately from result storage.

Implementation Plan

Structured parser for current run-measure. Add a host parser that converts existing measure: and demo output lines into JSON artifacts with config hash, raw log path, and verifier status.
Primitive benchmark manifest set. Split ring, park, IPC, process, VM, and scheduler benchmarks into focused manifests so each can be repeated independently without running unrelated demos.
Reference guest harness. Add Linux guest scripts that run equivalent primitive tests under the same QEMU/KVM settings. Keep these scripts outside the capOS boot image.
Translated OS microbench suite. Implement capos-osbench for the subset of lmbench/UnixBench intents that capOS can represent honestly. Emit unsupported results for missing Store, file, mmap, and socket primitives until those subsystems exist.
Common workload pilots. Start with workloads that can be made deterministic early: compression, SQLite speedtest against RAM-backed storage once Store exists, shell/session latency, and remote-terminal user-story latency after the current milestone.
Network and storage workloads. Add iperf3/fio-equivalent profiles only after socket and block/storage capabilities exist. Use verification modes for write workloads.
Benchmark store and monitoring bridge. Add a BenchmarkStore service or CI artifact convention. Import only validated summary values into monitoring metrics, and audit privileged benchmark starts.
Regression gates. Add narrow CI thresholds for stable primitive paths. Use review-only warnings for noisy or hardware-dependent workloads until enough history exists.

Reporting Format

Published reports should include:

executive table with benchmark, status, unit, capOS median, Linux guest median, ratio, and notes;
separate sections for primitive, common workload, and user-story results;
correctness summary with failed/unsupported/invalid runs;
configuration appendix with hashes and QEMU commands;
raw artifact links;
explicit warning for benchmark-only builds, debug tap runs, or special caps.

Do not publish a capOS “system score.” The useful output is a workload matrix with enough context to explain the result.

Non-Goals

No POSIX compatibility layer purely to run Unix benchmarks.
No public comparison that treats unsupported workloads as zero performance.
No single aggregate score.
No benchmark-only fast paths in normal dispatch builds.
No always-on cycle-counter tracing.
No network result publication before the network path has correctness and authority proofs.
No storage result publication before write verification and crash/error semantics are defined.

Open Questions

Which Linux primitive baselines should be first-class: pipe, Unix socket, futex, eventfd, io_uring, or all of them?
Should the benchmark store be a capOS service, a host CI artifact convention, or both?
What variance threshold should turn a benchmark from a CI gate into a review-only signal?
How should reference OS images be pinned and distributed without bloating the repository?
What is the earliest honest SQLite storage profile: RAM-only, MemoryObject backed, Store-backed, or block-backed?
Should benchmark definitions be modeled as manifest fragments, host-side YAML/JSON, or capOS service objects?

Keyboard shortcuts

capOS Documentation