Benchmarks

capOS benchmark rows are evidence records. Each row should say what workload ran, what was verified, how time was measured, what machine envelope was used, and where the raw artifacts were stored. A faster row whose verifier did not complete is not a performance result.

The broader benchmark model is in System Performance Benchmarks. Future parallel-pattern coverage is in HPC Parallel Processing Patterns.

Current CPU Workloads

capOS currently has two CPU-scaling workloads:

Workload	Target	Timed region	Verifier	Primary use
`test-smp-process-scale`	Independent worker processes	worker compute only, after setup and before result reporting	aggregate prime count and checksum	Exercises multiple process-owned rings running CPU work on more than one scheduler CPU.
`test-thread-scale`	Sibling threads in one process	checksum work window, separate from spawn/join/shutdown totals	deterministic root checksum and metadata checks	Measures same-process thread scheduling, per-thread rings, and scheduler overhead.

Both workloads keep serial and harness artifacts under target/. The capOS rows below were collected under QEMU/KVM. The matching Linux rows use the same workload shape where possible, but units differ by harness and should not be compared directly across systems. Compare speedup ratios within a row.

Process-Scale SMP

make test-smp-process-scale boots a focused manifest, runs independent worker processes, and times the CPU-bound worker window. Each worker owns its own process ring. The timed section avoids syscalls and serial output; the coordinator verifies the aggregate result after workers finish.

The current workload counts primes over 2..3_000_000 using balanced contiguous splits. capOS reports a worker-side user-mode cycle counter shifted right by 20 bits. Linux reports guest clock_gettime nanoseconds.

Controlled benchmark-VM reruns were recorded on GCE n2-highcpu-8 at capOS commit 0d89a91b (2026-04-30 11:09 UTC) with nested QEMU/KVM on Ubuntu 6.17.0-1012-gcp, QEMU 8.2.2, Rust nightly 1.97.0-nightly (c935696dd 2026-04-29), and host logical CPUs 0,1,2,3 mapped to distinct physical cores with SMT siblings 4,5,6,7.

System	smp1 median	smp2 median	smp4 median	1-to-2 speedup	1-to-4 speedup
capOS	1,639 scaled cycles	875 scaled cycles	1,111 scaled cycles	1.873x	1.475x
Linux	1,275,187,210 ns	659,218,025 ns	337,877,986 ns	1.934x	3.774x

The capOS 4-vCPU row improved over the 1-vCPU row but was slower than the 2-vCPU row. Linux continued improving through 4 vCPUs under the same pinning and workload. Raw capOS artifacts are under target/smp-process-scale/pinned-20260430T1113Z/; raw Linux artifacts are under target/linux-smp-process-scale/pinned-20260430T1118Z/.

SMT Run

The same harness can run an eight-logical-CPU case on the benchmark VM. That machine exposes four physical cores and eight SMT threads, so the smp8-smt row is an SMT measurement on a 4-core host.

The SMT run was recorded at commit 7c15dd47 (2026-04-30 11:45 UTC) with QEMU pinned to logical CPUs 0,1,2,3,4,5,6,7.

System	smp1 median	smp2 median	smp4 median	smp8-smt median
capOS	1,500 scaled cycles	787 scaled cycles	1,052 scaled cycles	1,595 scaled cycles
Linux	1,274,507,854 ns	647,611,418 ns	337,479,795 ns	198,903,231 ns

System	1-to-2 speedup	1-to-4 speedup	1-to-8 speedup
capOS	1.906x	1.426x	0.940x
Linux	1.968x	3.777x	6.408x

Raw capOS SMT artifacts are under target/smp-process-scale/smt8-20260430T1148Z/. Raw Linux SMT artifacts are under target/linux-smp-process-scale/smt8-20260430T1151Z/.

In-Process Thread Scaling

make test-thread-scale runs sibling threads inside one process. Child threads use per-thread rings. The workload computes fixed-size checksum blocks; the default shape is a blocking parent join, 262,144 blocks (16 MiB), and work_rounds=64.

The harness records both a work-window time and a total time. The work window brackets the checksum computation. Total time includes thread startup, synchronization, shutdown, and join overhead. For scheduler analysis, both numbers matter: work speedup shows CPU placement and dispatch during the syscall-free section, while total speedup shows the cost of the surrounding thread lifecycle.

The old 1 MiB workload with a spinning parent is historical only because the matching Linux pthread baseline also stayed flat at four workers. The current rows use the repaired 16 MiB blocking-parent shape unless noted.

Recorded evidence:

System / mode	Placement	Runs	1-to-2 work	1-to-2 total	1-to-4 work	1-to-4 total	Notes
Linux pthread baseline (benchmark VM, 2026-05-10 19:46 UTC)	physical-core logical CPUs `0,1,2,3`	5	1.996x	1.995x	3.974x	3.850x	Same checksum workload and pin set as the 2026-05-10 capOS row.
capOS (Phase D WFQ, benchmark VM, 2026-05-10 19:32 UTC)	physical-core logical CPUs `0,1,2,3`	5	1.809x	1.774x	3.088x	2.700x	Per-thread weights/latency classes, per-CPU WFQ queues, bounded steal path.
Linux pthread baseline (benchmark VM, 2026-05-02 21:34 UTC)	physical-core logical CPUs `0,1,2,3`	5	1.988x	1.987x	3.963x	3.858x	Same repaired workload before Phase D.
capOS (single global queue, benchmark VM, 2026-05-02 21:35 UTC)	physical-core logical CPUs `0,1,2,3`	5	1.883x	1.787x	1.566x	1.538x	Shows the four-worker cost of the single global runnable queue.
Linux pthread baseline (2026-05-01 report)	physical-core logical CPUs	5	1.991x	1.990x	3.958x	3.834x	Repaired-shape baseline recorded in `docs/changelog.md`; target artifact directory is not named in the source record.
capOS (pre-collapse placement, 2026-05-01 report)	physical-core logical CPUs	5	1.828x	1.687x	3.029x	2.386x	Commit `136b72de`; per-CPU placement model later replaced by the queue-collapse cleanup; target artifact directory is not named in the source record.
capOS, switch logs suppressed (pre-collapse, 2026-05-01 report)	physical-core logical CPUs	5	1.913x	1.636x	3.272x	2.303x	Same commit and model with scheduler switch logs suppressed; target artifact directory is not named in the source record.
capOS (post-collapse, single global queue, 2026-05-02 10:42 UTC)	physical-core logical CPUs `0,1,2,3` on the benchmark VM	3	1.890x	1.792x	1.504x	1.436x	Queue-collapse row recorded in `docs/backlog/scheduler-evolution.md`; target artifact directory is not named in the source record.

The 2026-05-10 Phase D WFQ row uses the same repaired shape as the 2026-05-02 pair: blocking parent join, 262,144 blocks, work_rounds=64, five runs, KVM-backed QEMU pinned to physical-core logical CPUs 0,1,2,3, and a matching Linux pthread baseline on the same pin set. Raw capOS artifacts are under target/thread-scale/20260510T193200Z/; raw Linux artifacts are under target/linux-thread-scale/20260510T194600Z/.

The 2026-05-02 capOS/Linux pair used main commit 374f8556; raw capOS artifacts are under target/thread-scale/20260502T213544Z/, and raw Linux artifacts are under target/linux-thread-scale/20260502T213445Z/.

The row improved the four-worker work window from 1.566x to 3.088x and the four-worker total window from 1.538x to 2.700x compared with the single-global-queue row. Linux on the same host and pin set recorded 3.974x work and 3.850x total at four workers. The remaining difference is the scheduler/runtime optimization target for later work.

Guest-side attribution is available with CAPOS_THREAD_SCALE_GUEST_MEASURE=1. It emits aggregate and per-phase measurements for spawn_ready, work, shutdown, and final_total, including scheduler choice, lock, timer, TLB, serial, shared-kernel-lock, network-poll, thread-placement, and sampled user-PC buckets. Host-side QEMU profiling is available with CAPOS_THREAD_SCALE_PROFILE=1.

Direct IPC Handoff Timing

The 2026-07-27 direct-IPC run measures a CALL-to-receiver workload interval for a synchronous two-thread endpoint exchange and, independently, the scheduler interval inside direct handoffs. The parent performs eight untimed warmup exchanges followed by 64 measured exchanges. It reads the guest TSC immediately before CALL submission; the receiver reads the same TSC immediately after endpoint_recv_wait returns and sends that timestamp back. The resulting user-side interval includes CALL publication, kernel endpoint dispatch, receiver scheduling, context restore, and receiver resume, but stops before RETURN scheduling. It is a mixed-path workload metric because the output does not identify whether each receiver wake used the direct target or the ordinary queue.

The measure-only kernel counter starts when ready_thread_for_direct_ipc publishes the unblocked receiver as WakePolicy::DirectTarget and stops when choose_next_locked selects it. It therefore isolates ready-to-selection latency. The ordinary direct-handoff debug line is suppressed only in the measure build so serial output between selection and context restore does not inflate the user-side interval.

Five foreground runs of CAPOS_MEASURE_QEMU_TASKSET_CPUS=0 make run-measure were recorded from 2026-07-27 02:27–02:29 UTC. QEMU 8.2.2 ran one qemu64 guest vCPU with KVM enabled and was pinned to host logical CPU 0. The host was an 8-logical-CPU, 4-core Intel Xeon 2.80 GHz VM, so this is nested QEMU/KVM evidence rather than a bare-metal latency claim.

Measure	Runs	Median	Range within the 96-second collection
CALL-to-receiver mean per run	5	7,924,721 cycles	7,757,445–8,331,819 cycles
Kernel ready-to-selection mean per run	5	396,515 cycles	387,503–414,989 cycles
Individual kernel ready-to-selection interval	357 handoffs	—	378,337–534,133 cycles

The boot-global kernel counter recorded 71 or 72 direct selections while the workload issued 72 endpoint calls. It is not tagged by workload round, and a TSC underflow is discarded, so the record cannot identify which calls used the direct target. The CALL-to-receiver row therefore includes all 64 measured workload samples; only the kernel ready-to-selection rows describe observed direct selections. RETURN is deliberately outside both reported intervals: CAP_OP_RETURN wakes the caller through wake_cap_waiter_if_satisfied, which uses ordinary WFQ enqueue rather than direct_ipc_target. An end-to-end CALL/RETURN timer therefore includes queued caller wakeup and can quantize at the 10 ms TICK_NS; it is not a direct-handoff latency measure.

Host-CPU pinning is not isolation, guest TSC behavior is mediated by nested virtualization, and the measure image still contains other debug/serial instrumentation. The displayed ranges describe only the five samples collected inside this 96-second window; they are not expected to contain later runs even on the same host. These results establish a current-state QEMU baseline, not a stable distribution or a hardware, production, or cross-machine bound. Raw serial and terminal logs, command templates, normalized log hashes, and host topology are committed in the 2026-07-27 direct-IPC evidence bundle. The matching scheduler risk and bound analysis is in Direct IPC timing and priority inversion.

Interpreting CPU Counts

CPU-count rows are meaningful only with a recorded topology:

Physical-core rows require enough physical cores for the vCPU count.
SMT rows should say they are SMT rows and list the logical CPU set.
Pinning QEMU with taskset is useful, but it is not CPU isolation by itself. Stronger runs should record isolcpus/nohz_full/rcu_nocbs, cpuset, or systemd affinity policy when used.
Pinning QEMU to fewer host logical CPUs than guest vCPUs measures oversubscription behavior, not core scaling.
Current QEMU/KVM results should stay separate from future direct cloud or bare-metal runs.

The current capOS benchmark table reaches four physical-core rows and an eight-logical-CPU SMT row on a 4-core/8-thread VM. It does not yet measure 16-core or 32-core systems.

Next CPU-Scaling Work

The next CPU-scaling milestone should be designed around direct hardware or a dedicated perf runner rather than nested QEMU as the primary evidence source. The benchmark suite needs:

hardware discovery records for socket/core/SMT topology, APIC mode, timer source, frequency policy, memory size, and firmware/device model;
workload rows at 1, 2, 4, 8, 16, and 32 workers where the machine has enough physical cores, plus separately labeled SMT rows;
at least one static map/reduce checksum workload, one uneven dynamic-task workload, one barrier-heavy phase loop, and one IPC/service-bound workload;
work-window and total-time reporting for every workload;
matching Linux native baselines on the same hardware where a comparable workload exists;
scheduler/runtime counters for queue depth, migrations, steals, reschedule IPIs, TLB shootdowns, timer ticks, lock wait/hold time, blocked time, and runnable but not running time;
raw artifacts with source commit, toolchain, kernel config, host topology, run count, warmup policy, and verifier output.

QEMU should remain useful for boot and regression coverage, but it should not be the primary source for a 16/32-core SMP scalability milestone.

Commands

Run the capOS process-scale workload:

make test-smp-process-scale

Run the process-scale workload with QEMU pinned to selected host CPUs:

CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1 make test-smp-process-scale

Run the process-scale SMT row on a host with at least eight logical CPUs:

CAPOS_SMP_SCALE_INCLUDE_SMT=1 \
  CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1,2,3,4,5,6,7 \
  make test-smp-process-scale

Run the thread-scale workload:

CAPOS_THREAD_SCALE_RUNS=5 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make test-thread-scale

Run the larger-workload Amdahl row:

CAPOS_THREAD_SCALE_RUNS=5 \
  CAPOS_THREAD_SCALE_TOTAL_BLOCKS=1048576 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make test-thread-scale

Run a one-sample host-side QEMU profiling pass:

CAPOS_THREAD_SCALE_PROFILE=1 \
  CAPOS_THREAD_SCALE_RUNS=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make test-thread-scale

Run a one-sample guest-side measurement pass:

CAPOS_THREAD_SCALE_GUEST_MEASURE=1 \
  CAPOS_THREAD_SCALE_RUNS=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make test-thread-scale

Run only the host summary parser against an existing results.csv without booting QEMU:

CAPOS_THREAD_SCALE_SUMMARY_ONLY=1 \
  CAPOS_THREAD_SCALE_SUMMARY_CSV=<results.csv> \
  CAPOS_THREAD_SCALE_SUMMARY_KVM_EVIDENCE=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  CAPOS_THREAD_SCALE_TOTAL_BLOCKS=262144 \
  CAPOS_THREAD_SCALE_PARENT_WAIT=join \
  CAPOS_THREAD_SCALE_WORK_ROUNDS=64 \
  tools/qemu-thread-scale-harness.sh

Run the native Linux pthread baseline for the thread-scale checksum workload:

LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \
  make test-linux-thread-scale-baseline

Run the Linux process-scale comparison:

LINUX_SMP_SCALE_KERNEL=target/linux-smp-process-scale/kernel/vmlinuz \
  tools/linux-smp-process-scale-baseline.sh

On hosts where /boot/vmlinuz is not readable by the current user, copy a kernel image into ignored target/ storage first through the host’s normal administrative path, then pass it as LINUX_SMP_SCALE_KERNEL. The script does not invoke sudo itself.

Keyboard shortcuts

capOS Documentation