Benchmarks
capOS benchmark rows are evidence records. Each row should say what workload ran, what was verified, how time was measured, what machine envelope was used, and where the raw artifacts were stored. A faster row whose verifier did not complete is not a performance result.
The broader benchmark model is in System Performance Benchmarks. Future parallel-pattern coverage is in HPC Parallel Processing Patterns.
Current CPU Workloads
capOS currently has two CPU-scaling workloads:
| Workload | Target | Timed region | Verifier | Primary use |
|---|---|---|---|---|
run-smp-process-scale | Independent worker processes | worker compute only, after setup and before result reporting | aggregate prime count and checksum | Exercises multiple process-owned rings running CPU work on more than one scheduler CPU. |
run-thread-scale | Sibling threads in one process | checksum work window, separate from spawn/join/shutdown totals | deterministic root checksum and metadata checks | Measures same-process thread scheduling, per-thread rings, and scheduler overhead. |
Both workloads keep serial and harness artifacts under target/. The capOS
rows below were collected under QEMU/KVM. The matching Linux rows use the same
workload shape where possible, but units differ by harness and should not be
compared directly across systems. Compare speedup ratios within a row.
Process-Scale SMP
make run-smp-process-scale boots a focused manifest, runs independent worker
processes, and times the CPU-bound worker window. Each worker owns its own
process ring. The timed section avoids syscalls and serial output; the
coordinator verifies the aggregate result after workers finish.
The current workload counts primes over 2..3_000_000 using balanced
contiguous splits. capOS reports a worker-side user-mode cycle counter shifted
right by 20 bits. Linux reports guest clock_gettime nanoseconds.
Controlled benchmark-VM reruns were recorded on GCE n2-highcpu-8 at capOS
commit 0d89a91b (2026-04-30 11:09 UTC) with nested QEMU/KVM on Ubuntu
6.17.0-1012-gcp, QEMU 8.2.2, Rust nightly 1.97.0-nightly
(c935696dd 2026-04-29), and host logical CPUs 0,1,2,3 mapped to distinct
physical cores with SMT siblings 4,5,6,7.
| System | smp1 median | smp2 median | smp4 median | 1-to-2 speedup | 1-to-4 speedup |
|---|---|---|---|---|---|
| capOS | 1,639 scaled cycles | 875 scaled cycles | 1,111 scaled cycles | 1.873x | 1.475x |
| Linux | 1,275,187,210 ns | 659,218,025 ns | 337,877,986 ns | 1.934x | 3.774x |
The capOS 4-vCPU row improved over the 1-vCPU row but was slower than the
2-vCPU row. Linux continued improving through 4 vCPUs under the same pinning
and workload. Raw capOS artifacts are under
target/smp-process-scale/pinned-20260430T1113Z/; raw Linux artifacts are
under target/linux-smp-process-scale/pinned-20260430T1118Z/.
SMT Run
The same harness can run an eight-logical-CPU case on the benchmark VM. That
machine exposes four physical cores and eight SMT threads, so the smp8-smt
row is an SMT measurement on a 4-core host.
The SMT run was recorded at commit 7c15dd47
(2026-04-30 11:45 UTC) with QEMU pinned to logical CPUs
0,1,2,3,4,5,6,7.
| System | smp1 median | smp2 median | smp4 median | smp8-smt median |
|---|---|---|---|---|
| capOS | 1,500 scaled cycles | 787 scaled cycles | 1,052 scaled cycles | 1,595 scaled cycles |
| Linux | 1,274,507,854 ns | 647,611,418 ns | 337,479,795 ns | 198,903,231 ns |
| System | 1-to-2 speedup | 1-to-4 speedup | 1-to-8 speedup |
|---|---|---|---|
| capOS | 1.906x | 1.426x | 0.940x |
| Linux | 1.968x | 3.777x | 6.408x |
Raw capOS SMT artifacts are under target/smp-process-scale/smt8-20260430T1148Z/.
Raw Linux SMT artifacts are under
target/linux-smp-process-scale/smt8-20260430T1151Z/.
In-Process Thread Scaling
make run-thread-scale runs sibling threads inside one process. Child threads
use per-thread rings. The workload computes fixed-size checksum blocks; the
default shape is a blocking parent join, 262,144 blocks (16 MiB), and
work_rounds=64.
The harness records both a work-window time and a total time. The work window brackets the checksum computation. Total time includes thread startup, synchronization, shutdown, and join overhead. For scheduler analysis, both numbers matter: work speedup shows CPU placement and dispatch during the syscall-free section, while total speedup shows the cost of the surrounding thread lifecycle.
The old 1 MiB workload with a spinning parent is historical only because the matching Linux pthread baseline also stayed flat at four workers. The current rows use the repaired 16 MiB blocking-parent shape unless noted.
Recorded evidence:
| System / mode | Placement | Runs | 1-to-2 work | 1-to-2 total | 1-to-4 work | 1-to-4 total | Notes |
|---|---|---|---|---|---|---|---|
| Linux pthread baseline (benchmark VM, 2026-05-10 19:46 UTC) | physical-core logical CPUs 0,1,2,3 | 5 | 1.996x | 1.995x | 3.974x | 3.850x | Same checksum workload and pin set as the 2026-05-10 capOS row. |
| capOS (Phase D WFQ, benchmark VM, 2026-05-10 19:32 UTC) | physical-core logical CPUs 0,1,2,3 | 5 | 1.809x | 1.774x | 3.088x | 2.700x | Per-thread weights/latency classes, per-CPU WFQ queues, bounded steal path. |
| Linux pthread baseline (benchmark VM, 2026-05-02 21:34 UTC) | physical-core logical CPUs 0,1,2,3 | 5 | 1.988x | 1.987x | 3.963x | 3.858x | Same repaired workload before Phase D. |
| capOS (single global queue, benchmark VM, 2026-05-02 21:35 UTC) | physical-core logical CPUs 0,1,2,3 | 5 | 1.883x | 1.787x | 1.566x | 1.538x | Shows the four-worker cost of the single global runnable queue. |
| Linux pthread baseline (2026-05-01 report) | physical-core logical CPUs | 5 | 1.991x | 1.990x | 3.958x | 3.834x | Repaired-shape baseline recorded in docs/changelog.md; target artifact directory is not named in the source record. |
| capOS (pre-collapse placement, 2026-05-01 report) | physical-core logical CPUs | 5 | 1.828x | 1.687x | 3.029x | 2.386x | Commit 136b72de; per-CPU placement model later replaced by the queue-collapse cleanup; target artifact directory is not named in the source record. |
| capOS, switch logs suppressed (pre-collapse, 2026-05-01 report) | physical-core logical CPUs | 5 | 1.913x | 1.636x | 3.272x | 2.303x | Same commit and model with scheduler switch logs suppressed; target artifact directory is not named in the source record. |
| capOS (post-collapse, single global queue, 2026-05-02 10:42 UTC) | physical-core logical CPUs 0,1,2,3 on the benchmark VM | 3 | 1.890x | 1.792x | 1.504x | 1.436x | Queue-collapse row recorded in docs/backlog/scheduler-evolution.md; target artifact directory is not named in the source record. |
The 2026-05-10 Phase D WFQ row uses the same repaired shape as the 2026-05-02
pair: blocking parent join, 262,144 blocks, work_rounds=64, five runs,
KVM-backed QEMU pinned to physical-core logical CPUs 0,1,2,3, and a matching
Linux pthread baseline on the same pin set. Raw capOS artifacts are under
target/thread-scale/20260510T193200Z/; raw Linux artifacts are under
target/linux-thread-scale/20260510T194600Z/.
The 2026-05-02 capOS/Linux pair used main commit 374f8556; raw capOS
artifacts are under target/thread-scale/20260502T213544Z/, and raw Linux
artifacts are under target/linux-thread-scale/20260502T213445Z/.
The row improved the four-worker work window from 1.566x to 3.088x and
the four-worker total window from 1.538x to 2.700x compared with the
single-global-queue row. Linux on the same host and pin set recorded
3.974x work and 3.850x total at four workers. The remaining difference is
the scheduler/runtime optimization target for later work.
Guest-side attribution is available with
CAPOS_THREAD_SCALE_GUEST_MEASURE=1. It emits aggregate and per-phase
measurements for spawn_ready, work, shutdown, and final_total,
including scheduler choice, lock, timer, TLB, serial, shared-kernel-lock,
network-poll, thread-placement, and sampled user-PC buckets. Host-side QEMU
profiling is available with CAPOS_THREAD_SCALE_PROFILE=1.
Interpreting CPU Counts
CPU-count rows are meaningful only with a recorded topology:
- Physical-core rows require enough physical cores for the vCPU count.
- SMT rows should say they are SMT rows and list the logical CPU set.
- Pinning QEMU with
tasksetis useful, but it is not CPU isolation by itself. Stronger runs should recordisolcpus/nohz_full/rcu_nocbs, cpuset, or systemd affinity policy when used. - Pinning QEMU to fewer host logical CPUs than guest vCPUs measures oversubscription behavior, not core scaling.
- Current QEMU/KVM results should stay separate from future direct cloud or bare-metal runs.
The current capOS benchmark table reaches four physical-core rows and an eight-logical-CPU SMT row on a 4-core/8-thread VM. It does not yet measure 16-core or 32-core systems.
Next CPU-Scaling Work
The next CPU-scaling milestone should be designed around direct hardware or a dedicated perf runner rather than nested QEMU as the primary evidence source. The benchmark suite needs:
- hardware discovery records for socket/core/SMT topology, APIC mode, timer source, frequency policy, memory size, and firmware/device model;
- workload rows at 1, 2, 4, 8, 16, and 32 workers where the machine has enough physical cores, plus separately labeled SMT rows;
- at least one static map/reduce checksum workload, one uneven dynamic-task workload, one barrier-heavy phase loop, and one IPC/service-bound workload;
- work-window and total-time reporting for every workload;
- matching Linux native baselines on the same hardware where a comparable workload exists;
- scheduler/runtime counters for queue depth, migrations, steals, reschedule IPIs, TLB shootdowns, timer ticks, lock wait/hold time, blocked time, and runnable but not running time;
- raw artifacts with source commit, toolchain, kernel config, host topology, run count, warmup policy, and verifier output.
QEMU should remain useful for boot and regression coverage, but it should not be the primary source for a 16/32-core SMP scalability milestone.
Commands
Run the capOS process-scale workload:
make run-smp-process-scale
Run the process-scale workload with QEMU pinned to selected host CPUs:
CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1 make run-smp-process-scale
Run the process-scale SMT row on a host with at least eight logical CPUs:
CAPOS_SMP_SCALE_INCLUDE_SMT=1 \
CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1,2,3,4,5,6,7 \
make run-smp-process-scale
Run the thread-scale workload:
CAPOS_THREAD_SCALE_RUNS=5 \
CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
make run-thread-scale
Run the larger-workload Amdahl row:
CAPOS_THREAD_SCALE_RUNS=5 \
CAPOS_THREAD_SCALE_TOTAL_BLOCKS=1048576 \
CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
make run-thread-scale
Run a one-sample host-side QEMU profiling pass:
CAPOS_THREAD_SCALE_PROFILE=1 \
CAPOS_THREAD_SCALE_RUNS=1 \
CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
make run-thread-scale
Run a one-sample guest-side measurement pass:
CAPOS_THREAD_SCALE_GUEST_MEASURE=1 \
CAPOS_THREAD_SCALE_RUNS=1 \
CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
make run-thread-scale
Run only the host summary parser against an existing results.csv without
booting QEMU:
CAPOS_THREAD_SCALE_SUMMARY_ONLY=1 \
CAPOS_THREAD_SCALE_SUMMARY_CSV=<results.csv> \
CAPOS_THREAD_SCALE_SUMMARY_KVM_EVIDENCE=1 \
CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
CAPOS_THREAD_SCALE_TOTAL_BLOCKS=262144 \
CAPOS_THREAD_SCALE_PARENT_WAIT=join \
CAPOS_THREAD_SCALE_WORK_ROUNDS=64 \
tools/qemu-thread-scale-harness.sh
Run the native Linux pthread baseline for the thread-scale checksum workload:
LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \
make run-linux-thread-scale-baseline
Run the Linux process-scale comparison:
LINUX_SMP_SCALE_KERNEL=target/linux-smp-process-scale/kernel/vmlinuz \
tools/linux-smp-process-scale-baseline.sh
On hosts where /boot/vmlinuz is not readable by the current user, copy a
kernel image into ignored target/ storage first through the host’s normal
administrative path, then pass it as LINUX_SMP_SCALE_KERNEL. The script does
not invoke sudo itself.