Proposal: System Performance Benchmarks
How capOS should benchmark system performance against other operating systems without producing misleading numbers, rewarding special-case optimizations, or treating speed as a substitute for correct capability behavior.
Problem
capOS already has smoke tests, QEMU boot proofs, ring-tap debugging, and a
measure feature for focused cycle measurements. Those are necessary, but they
do not answer the product-level question: can capOS remain effective on common
workloads while preserving its capability model?
Generic OS benchmark suites are useful but dangerous in this project. Most assume POSIX process, file, pipe, socket, and shell semantics. capOS should not fake broad ambient Unix authority just to run a familiar benchmark. It also should not compare a capability-native path against Linux, FreeBSD, or a microkernel by publishing a single blended score that hides unsupported semantics, incorrect outputs, or different isolation boundaries.
The benchmark system needs to produce three kinds of evidence:
- Primitive cost: capability calls, IPC, scheduling, park waits, VM changes, process creation, memory copy, and later device I/O.
- Common workload adequacy: database, compression, build, network, storage, shell/session, service graph, and runtime workloads that users recognize.
- Correctness under load: workload outputs, service boundaries, capability denial paths, and data integrity must remain correct while performance is measured.
Current State
Implemented measurement and comparison hooks:
make run-measurebuilds a separate measurement kernel feature and bootssystem-measure.cue.kernel/src/measure.rsrecords benchmark-only dispatch counters and cycle segments for ring processing, SQE validation, cap lookup, Cap’n Proto encode/decode, method body dispatch, CQE posting, and waiter wake checks.- The measurement manifest grants
ring-nopa measurement-onlyNullCapandParkBenchcapability throughProcessSpawner. demos/ring-nopmeasuresCAP_OP_NOP, empty and smallNullCapcalls, and compact-versus-generic park-shaped operations.demos/thread-lifecyclemeasures privateParkSpacefailed wait, empty wake, wait-to-block, wake-to-runnable, and wake-to-resume paths.make run-smoke,make run-spawn,make run-net, and focused service smokes provide correctness and user-visible behavior proofs, but they do not yet emit structured performance results.
That is enough for local dispatch decisions. It is not enough for comparing capOS with Linux, FreeBSD, seL4-based systems, Genode scenarios, or other OS baselines on common workloads.
Design Principles
- Correctness gates first. A benchmark result is publishable only when the workload’s output verifier passes and capOS-specific authority checks still hold.
- No semantic laundering. Unsupported POSIX features are reported as unsupported or not applicable, not silently emulated through broad authority.
- Benchmark artifacts are not normal metrics. Always-on monitoring may expose low-cost counters. Benchmark logs, raw samples, host configuration, and per-run outputs are retained as explicit benchmark artifacts.
- Compare like mechanisms where possible. Compare capOS capability IPC to Linux pipes, Unix domain sockets, io_uring, or futexes only when the semantic differences are declared in the result.
- Use common suites as references, not design masters. lmbench, UnixBench, fio, iperf3, SQLite speedtest, Phoronix/OpenBenchmarking profiles, and SPEC CPU are valuable precedent. capOS should adopt their methodology where it fits and reject assumptions that would distort capOS.
- Publish raw context. Results include kernel commit, manifest, QEMU command, CPU model, host OS, compiler, build flags, feature flags, warmup, run count, and raw logs.
- Separate hosted and native comparisons. Early capOS runs in QEMU. Compare against Linux/FreeBSD guests under the same QEMU/KVM envelope, and separately against native host OS runs when the question is absolute hardware performance.
- Regression gates are narrower than claims. CI gates should catch local regressions in stable paths. Public OS comparisons need controlled machines, repeated runs, and manual review.
- Security posture is part of the result. A fast result that requires a broader cap bundle, disabled validation, payload tracing, or a special kernel build must be labeled as such.
- No single score. capOS should publish a matrix of workload results and ratios, not an aggregate score that implies all workloads matter equally.
Benchmark Tiers
Tier 0: Existing Correctness Smokes
Tier 0 is not a performance suite. It is the mandatory correctness floor:
- default boot/login/shell smoke;
- focused spawn, shell, terminal, credential, login, chat, adventure, revocable-read, memory-object, ringtap, networking, and measurement smokes;
- host tests for config, ring Loom, capos-lib, mkmanifest, generated code, and runtime surface checks.
No performance result should be retained when the relevant Tier 0 proof fails.
Tier 1: capOS-Native Primitive Benchmarks
These benchmarks measure the cost of capOS mechanisms directly:
| Area | Initial measurements | Correctness condition |
|---|---|---|
| Ring transport | CAP_OP_NOP, empty NullCap, small payload NullCap, CQE post | expected CQE result, no overflow, bounded dropped count |
| Cap dispatch | cap lookup, generation rejection, revoked cap rejection, invalid method | correct CAP_ERR_* or CapException |
| IPC | endpoint CALL/RECV/RETURN round trip, direct handoff, transfer copy/move | reply payload and transferred-cap identity match oracle |
| Park/threading | failed wait, timeout, wake-one, wake-many, wake-to-resume | waiter count and join status match oracle |
| Scheduler | context switch latency, timer wake latency, direct IPC handoff latency | no runnable-thread loss or unexpected starvation |
| Process lifecycle | spawn, ELF load, wait, failed spawn rejection | child output and exit code match manifest oracle |
| VM/memory | map/protect/unmap, MemoryObject map, frame allocation/free | data visibility, W^X, quota, and cleanup checks pass |
| Terminal/session | readLine/write latency and throughput under foreground ownership | echo/cancellation/stale-input checks pass |
These are capOS results first. Linux or FreeBSD baselines can use matching
native mechanisms, but the report must describe the mapping. For example, a
capOS endpoint IPC round trip can be compared with Linux pipe, Unix-domain
socket, eventfd, or futex ping-pong results, but none is a perfect semantic
match.
Tier 2: Translated OS Microbenchmarks
lmbench and UnixBench are useful because they isolate OS primitives such as system-call overhead, process creation, context switching, pipes, networking, and filesystem reads. They are also Unix-shaped.
capOS should implement a capos-osbench harness that translates the benchmark
intent into capability-native operations:
fork/exec/waitintent becomesProcessSpawner.spawnplusProcessHandle.wait.pipethroughput/context switching becomes Endpoint or a future byte-stream or socket capability round trip, labeled by transport.getpidsyscall overhead becomes a minimal kernel fact cap orCAP_OP_NOP, labeled as “capOS ring entry” rather than “POSIX syscall”.- file reread and mmap benchmarks remain unsupported until Store/Namespace and file-backed mappings exist.
- networking tests map to
TcpSocket/TcpListeneronce the Telnet and socket capability work lands.
The translated suite must emit not_applicable for missing capability
subsystems instead of adding compatibility shims that change the OS being
measured.
Tier 3: Portable Common Workloads
These benchmarks answer whether capOS is useful on recognizable work:
| Workload | Candidate benchmark | capOS prerequisite | Result verifier |
|---|---|---|---|
| SQLite database | SQLite speedtest1, optionally via a Phoronix profile on reference OSes | C runtime or native port, Store/Namespace or RAM-backed DB | SQLite exit status, optional SQL result checksum |
| OLTP database | TPC-C/TPC-E-inspired profile, not an official TPC result until disclosure and durability rules are met | durable Store/block I/O, SQL/database stack, transaction integrity, terminal/client driver model | committed transaction counts, invariant checks, ACID/error-injection proof |
| Decision-support database | TPC-H/TPC-DS-inspired profile at declared scale factors, not an official TPC result until rules are met | SQL/query engine, bulk data load, durable or explicitly memory-backed storage, query result verifier | query answer hashes, load status, scale factor, refresh/query stream status |
| Key-value serving | YCSB-style read/update/scan/insert mixes | Store/Namespace, KV service, stable client driver | operation counts, latency distribution, value/hash verifier |
| Storage engine | RocksDB/LevelDB db_bench-style fill/read/overwrite/seek profiles | file/store semantics, fsync/sync policy, storage engine port | key/value integrity, database reopen, configured write durability |
| Compression | xz, zstd, or small native compressor corpus | C/Rust userspace runtime and file/store access | compressed output hash and decompression hash |
| Build/developer workload | small Rust/C package build, later IX package build | process spawning, Store/Namespace, toolchain support | output artifact hash and build log status |
| Network throughput | iperf3-equivalent TCP stream and request/response latency | TcpSocket, network harness | byte count, JSON/structured summary, peer checksum |
| Storage I/O | fio-equivalent sequential/random read/write, verify mode | block device, Store/Namespace, direct I/O policy | fio-style verify/checksum result |
| File service | SPECstorage-inspired workload profile | network filesystem or capOS file-service equivalent, durable storage, client load generation | throughput, response time, data integrity |
| Java/server runtime | SPECjbb 2015 or Renaissance-inspired profiles | JVM or Java compatibility profile, timers, threads, networking/storage as needed | benchmark verifier and SLA/throughput summary |
| HTTP service | wrk-style request load against a capOS HTTP service | TCP, HTTP service, stable response corpus | response checksum/status mix, latency distribution, error rate |
| Cloud services | CloudSuite-inspired data caching/serving/search/web profiles | multi-service graph, storage/network/runtime support | workload-specific answer checks and service SLOs |
| Microservices | DeathStarBench/TailBench-inspired tail-latency profiles | service graph, network or local RPC, load generator, tracing/status caps | request correctness, p95/p99 latency, no unauthorized cap exposure |
| ML storage | MLPerf Storage-inspired data feeding profile | high-throughput storage path, dataset loader, accelerator or simulated training reader | records/images delivered, latency/throughput, data checksum |
| ML inference/training | MLPerf-inspired inference/training profile | model runtime, accelerator/GPU capability or CPU baseline, dataset and accuracy harness | accuracy/quality target plus throughput or time-to-train |
| Shell/session | boot-to-shell, Telnet shell, command launch latency | current shell plus terminal/socket path | transcript oracle and authority denial checks |
| Service graph | chat/adventure/resident service load | shared-service demos | scripted transcript and service identity checks |
| Runtime/library | Go/Lua/Wasm micro and app kernels | relevant runtime proposal milestones | language-level test suite or checksum oracle |
Early capOS should start with RAM-backed variants where storage is not ready, but those results must be labeled as memory-backed. A RAM-backed database result does not compare to a Linux disk-backed SQLite result.
Industry benchmark families belong later than SQLite speedtest and simple compression/build profiles. TPC-C/TPC-E and TPC-H/TPC-DS are database-system references with strict workload, disclosure, pricing, and correctness expectations. SPEC, MLPerf, CloudSuite, TailBench, and DeathStarBench bring similar setup and disclosure obligations in their domains. capOS can use inspired profiles to exercise the same workload classes before it can make official or directly comparable claims, but reports must label them as such and state which upstream rules are not yet satisfied.
Tier 4: User-Story Benchmarks
User-story benchmarks measure complete workflows that a person, operator, or service owner would recognize. They are intentionally broader than a single primitive or portable benchmark profile, and they should be described by the user outcome they prove rather than by the current demo implementation.
Initial user stories:
| Story | Example capOS proof | Result verifier |
|---|---|---|
| Start a local session | boot to an interactive shell or terminal prompt | transcript reaches ready prompt with expected cap bundle |
| Authenticate and receive authority | anonymous session upgrades to an operator/session profile | wrong credential denied, right credential grants exact profile |
| Run a delegated task | launch a child process with a narrow cap bundle | child output, exit code, and denied extra authority match oracle |
| Use a remote terminal | host-local TCP terminal reaches the same shell/session model | connect, authenticate, run command, clean disconnect |
| Use a resident service | client talks to a long-running service through scoped authority | request/reply transcript and service-visible identity match oracle |
| Serve a network request | network-facing service handles requests while local work continues | response checksum, latency, and no unauthorized cap exposure |
| Complete a developer workflow | build or transform an artifact from declared inputs | output hash, logs, and resource profile match declared policy |
| Recover from expected failure | service fault, rejected grant, timeout, or restart path | failure is bounded, audited, and visible through status |
User-story results report latency distribution, success rate, resource usage, and authority outcome. They are the closest evidence for “effective on common workloads,” but they are not substitutes for primitive measurements when a regression appears.
Reference Operating Systems
Initial comparisons should use these environments:
| Reference | Why include it | Caveat |
|---|---|---|
| Linux guest under same QEMU/KVM flags | Stable baseline with broad benchmark support | Linux has mature drivers, filesystems, VM, scheduler, and libc |
| FreeBSD guest under same QEMU/KVM flags | Second mature Unix-like baseline, useful for POSIX-independent signal | Not every benchmark profile has equal FreeBSD support |
| Linux native host | Shows absolute host hardware ceiling | Not directly comparable to capOS-in-QEMU latency |
| seL4 or Genode reports/scenarios | Prior art for capability/microkernel IPC and service decomposition | Often not the same hardware, workload, or application stack |
The default published table should show capOS versus Linux guest first. Native host and external microkernel data belong in separate context columns, not the primary ratio.
Correctness Model
Every benchmark definition carries:
- expected input corpus hash;
- command or manifest used to run the workload;
- output verifier;
- allowed nondeterminism, such as timestamps or generated IDs;
- capOS authority profile;
- unsupported-feature policy;
- result parser version.
A result is invalid when:
- the output verifier fails;
- QEMU exits abnormally;
- the kernel panics or reports an unexpected fault;
- the benchmark had to grant broader authority than its declared profile;
- host logs show dropped records that invalidate the measurement;
- the run used a special fast path not available in the declared configuration;
- the reference OS result used a materially different workload size or dataset.
Correctness should be stored alongside the performance value. A fast failed run is not a slow successful run; it is no result.
Measurement Method
Controlled runs should use:
- fixed capOS commit, reference OS image hash, benchmark source hash, compiler version, and toolchain flags;
- fixed QEMU version, machine type, CPU model, memory size, SMP count, KVM/TCG mode, disk image type, and network backend;
- warmup runs for workloads with caches, JITs, connection setup, or first-use allocation;
- at least 5 measured runs for primitive and user-story benchmarks, more when coefficient of variation is high;
- median, min, max, standard deviation, and p95/p99 for latency where sample count supports it;
- raw logs retained for the benchmark artifact;
- no performance claim from one isolated run unless explicitly labeled as a smoke measurement.
Cycle-counter measurements remain inside cfg(feature = "measure") and are
used for relative path decisions. Wall-clock user-story and workload
comparisons use host-side timestamps around QEMU transcripts or in-guest
monotonic timers when the timer contract is adequate.
Result Schema
The benchmark harness should emit a structured artifact, not a free-form log:
enum BenchmarkStatus {
passed @0;
failed @1;
unsupported @2;
invalid @3;
}
struct BenchmarkResult {
runId @0 :Text;
benchmarkName @1 :Text;
tier @2 :UInt16;
status @3 :BenchmarkStatus;
correctnessId @4 :Text;
configHash @5 :Data;
artifactHash @6 :Data;
notes @7 :Text;
result :union {
measurement @8 :MeasurementSummary;
failure @9 :RunFailure;
unsupported @10 :RunFailure;
invalid @11 :RunFailure;
}
}
struct MeasurementSummary {
unit @0 :Text;
lowerIsBetter @1 :Bool;
median @2 :Float64;
p95 @3 :Float64;
samples @4 :List(Float64);
}
struct RunFailure {
reason @0 :Text;
detail @1 :Text;
}
This schema is conceptual. It should not be added to schema/capos.capnp until
a concrete benchmark-runner service exists. The important property is that
measurement values exist only in the passed/publishable branch; failed,
unsupported, and invalid runs carry reasons instead of zero-valued scalar
defaults. Before that, host scripts can emit JSON with the same shape.
Integration With System Monitoring
System Monitoring should expose operational state; the benchmark system should store explicit run artifacts. The overlap is narrow:
- benchmark runs may read scoped
MetricsReader,SystemStatus,RingStats,SchedStats, and later device stats before and after a run; - benchmark summaries may be imported into a metrics service as low-cardinality
gauges such as
benchmark.last_median_ms, keyed by benchmark name and profile, after validation; - raw samples, transcripts, QEMU logs, host environment, and correctness
evidence belong in a
BenchmarkStoreor CI artifact store, not in always-on metrics; - starting a privileged benchmark profile is an auditable event because it may require measurement-only caps, debug taps, or broad status readers;
- benchmark readers should receive scoped read-only caps, not global monitoring roots.
The existing system-monitoring-proposal.md boundary remains correct:
cycle-counter instrumentation stays behind measure, while cheap counters can
later graduate into narrow stats caps.
External Grounding
Relevant local design grounding:
docs/build-run-test.mddocs/status.mddocs/proposals/system-monitoring-proposal.mddocs/architecture/capability-ring.mddocs/architecture/park.mddocs/architecture/scheduling.mddocs/research/sel4.mddocs/research/zircon.mddocs/research/genode.mddocs/research/out-of-kernel-scheduling.md
External sources checked:
- USENIX lmbench paper page:
https://www.usenix.org/conference/usenix-1996-annual-technical-conference/lmbench-portable-tools-performance-analysis - fio documentation:
https://fio.readthedocs.io/en/master/fio_doc.html - iperf3 documentation:
https://software.es.net/iperf/ - SPEC CPU 2017 overview and run rules:
https://www.spec.org/osg/cpu2017/andhttps://www.spec.org/cpu2017/Docs/runrules.html - Byte UnixBench repository:
https://github.com/kdlucas/byte-unixbench - SQLite testing documentation and OpenBenchmarking SQLite speedtest profile:
https://www.sqlite.org/testing.htmlandhttps://openbenchmarking.org/test/pts/sqlite-speedtest - TPC benchmark overview, TPC-C, TPC-H, and TPC-DS descriptions:
https://www.tpc.org/information/benchmarks5.asp,https://www.tpc.org/tpcc/default5.asp,https://www.tpc.org/tpch/default5.asp, andhttps://www.tpc.org/tpcds/ - YCSB and storage-engine benchmark references:
https://hse-project.github.io/apps/ycsb/,https://github.com/facebook/rocksdb/wiki/Benchmarking-tools, andhttps://github.com/google/leveldb - SPECjbb 2015, Renaissance, and HTTP service benchmark references:
https://www.spec.org/jbb2015/,https://renaissance.dev/, andhttps://github.com/wg/wrk - Cloud/service benchmark references:
https://github.com/parsa-epfl/cloudsuite,https://github.com/delimitrou/DeathStarBench, andhttps://tailbench.csail.mit.edu/ - Storage and ML benchmark references:
https://www.spec.org/storage2020/,https://mlcommons.org/working-groups/benchmarks/storage/,https://mlcommons.org/benchmarks/training/, andhttps://docs.mlcommons.org/inference/index_gh/ - OpenBenchmarking test-suite/profile descriptions:
https://openbenchmarking.org/suites/andhttps://openbenchmarking.org/tests
The relevant lessons are straightforward:
- lmbench isolates OS primitives from larger application behavior and was explicitly used to compare system implementations.
- fio and iperf3 provide flexible, parameterized I/O and network workload models with machine-readable output and verification options.
- SPEC CPU’s run rules show why disclosure, correct output, and configuration control matter when publishing comparative results.
- UnixBench is useful as a historical system benchmark, but its own workload descriptions reveal Unix assumptions that capOS must translate carefully.
- SQLite speedtest is a recognizable application workload with broad public baseline data, but database benchmarking must distinguish RAM-backed and storage-backed results.
- TPC-C/TPC-E and TPC-H/TPC-DS are the right industry references for later OLTP and decision-support database claims, but capOS should treat early runs as TPC-inspired unless it can satisfy the relevant TPC rules and disclosure requirements.
- YCSB and
db_benchare useful earlier data-system pressure tests because they can exercise key-value, read/write mix, and storage-engine behavior before capOS has a full SQL system. - SPECjbb and Renaissance become relevant only when a Java profile exists; until then they are runtime targets, not near-term OS benchmarks.
- CloudSuite, DeathStarBench, and TailBench are good references for cloud, microservice, and tail-latency user stories, but they require a mature service graph, load generation, and workload-specific correctness checks.
- SPECstorage and MLPerf Storage are later storage references once capOS has durable storage and enough client/load infrastructure to avoid misleading fio-only claims.
- MLPerf inference/training is relevant only after model runtimes and accelerator or CPU-baseline execution are credible, and any result must carry the benchmark’s accuracy or quality target rather than only throughput.
- OpenBenchmarking/Phoronix-style test profiles are useful precedent for packaging benchmark definitions separately from result storage.
Implementation Plan
-
Structured parser for current
run-measure. Add a host parser that converts existingmeasure:and demo output lines into JSON artifacts with config hash, raw log path, and verifier status. -
Primitive benchmark manifest set. Split ring, park, IPC, process, VM, and scheduler benchmarks into focused manifests so each can be repeated independently without running unrelated demos.
-
Reference guest harness. Add Linux guest scripts that run equivalent primitive tests under the same QEMU/KVM settings. Keep these scripts outside the capOS boot image.
-
Translated OS microbench suite. Implement
capos-osbenchfor the subset of lmbench/UnixBench intents that capOS can represent honestly. Emit unsupported results for missing Store, file, mmap, and socket primitives until those subsystems exist. -
Common workload pilots. Start with workloads that can be made deterministic early: compression, SQLite speedtest against RAM-backed storage once Store exists, shell/session latency, and remote-terminal user-story latency after the current milestone.
-
Network and storage workloads. Add iperf3/fio-equivalent profiles only after socket and block/storage capabilities exist. Use verification modes for write workloads.
-
Benchmark store and monitoring bridge. Add a
BenchmarkStoreservice or CI artifact convention. Import only validated summary values into monitoring metrics, and audit privileged benchmark starts. -
Regression gates. Add narrow CI thresholds for stable primitive paths. Use review-only warnings for noisy or hardware-dependent workloads until enough history exists.
Reporting Format
Published reports should include:
- executive table with benchmark, status, unit, capOS median, Linux guest median, ratio, and notes;
- separate sections for primitive, common workload, and user-story results;
- correctness summary with failed/unsupported/invalid runs;
- configuration appendix with hashes and QEMU commands;
- raw artifact links;
- explicit warning for benchmark-only builds, debug tap runs, or special caps.
Do not publish a capOS “system score.” The useful output is a workload matrix with enough context to explain the result.
Non-Goals
- No POSIX compatibility layer purely to run Unix benchmarks.
- No public comparison that treats unsupported workloads as zero performance.
- No single aggregate score.
- No benchmark-only fast paths in normal dispatch builds.
- No always-on cycle-counter tracing.
- No network result publication before the network path has correctness and authority proofs.
- No storage result publication before write verification and crash/error semantics are defined.
Open Questions
- Which Linux primitive baselines should be first-class: pipe, Unix socket, futex, eventfd, io_uring, or all of them?
- Should the benchmark store be a capOS service, a host CI artifact convention, or both?
- What variance threshold should turn a benchmark from a CI gate into a review-only signal?
- How should reference OS images be pinned and distributed without bloating the repository?
- What is the earliest honest SQLite storage profile: RAM-only, MemoryObject backed, Store-backed, or block-backed?
- Should benchmark definitions be modeled as manifest fragments, host-side YAML/JSON, or capOS service objects?