Proposal: HPC Parallel Processing Patterns

capOS should grow from focused SMP/threading speedup proofs into a correctness-gated suite of generic parallel processing patterns. The suite should cover the single-node and multi-node algorithm shapes commonly used in HPC without pretending that capOS already supports MPI, POSIX files, shared memory libraries, or cluster networking.

This proposal extends System Performance Benchmarks. It does not replace the selected In-Process Threading Scalability milestone. The immediate milestone remains proving same-process sibling thread speedup on the existing fixed-size checksum workload.

Design Grounding

Local grounding:

External grounding summarized in the research note covers Berkeley dwarfs, NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and OpenMP loop/task/reduction constructs.

Current capOS Benchmark Analysis

Current CPU-scaling evidence is useful but narrow:

make run-smp-process-scale proves independent worker processes can improve from one to two CPUs under QEMU/KVM. It is accepted for the completed Multi-Process SMP Concurrency milestone. Its prime-counting workload is static partition plus final verification, and the 4-vCPU/SMT rows remain diagnostic rather than accepted scaling claims.
make run-thread-scale uses a fixed-size checksum workload with per-thread rings and guest phase counters. It remains accepted=false: two-thread work is still roughly flat against one thread, total/shutdown cost grows with worker count, and SMT8 rows are informational on the 4-core/8-thread benchmark host.
Native Linux pthread baselines show the same checksum shape can scale in some cases, but also expose coordinator/oversubscription sensitivity. Larger workloads help separate Amdahl effects from thread lifecycle overhead.
Guest measurement now covers scheduler, serial, scheduler-lock, timer, TLB, and user-PC attribution, but the workload still represents only static partitioned CPU work.

So the current suite covers one pattern well: independent fixed chunks with a final checksum/reduction. It does not yet cover dynamic task scheduling, barriers, prefix/scan, all-to-all movement, stencils, sparse/dense kernels, graph frontiers, pipelines, or multi-node communication.

Goals

Classify parallel benchmark coverage by algorithm pattern, not by a single score or one “HPC benchmark” label.
Keep correctness and authority gates ahead of speed claims.
Provide single-node pattern kernels before multi-node transport exists.
Give scheduler, runtime, memory, IPC, storage, and networking work concrete future coverage targets.
Allow Linux/FreeBSD/MPI/OpenMP comparisons only when the semantic mapping is declared.

Non-Goals

Do not port MPI or OpenMP as a prerequisite for the first pattern kernels.
Do not run full HPL, HPCG, NAS, or Graph500 before capOS has the required runtime, memory, file/store, and network substrate.
Do not add POSIX compatibility or ambient filesystem authority only to run a familiar suite.
Do not count SMT diagnostics as core-count scaling evidence.
Do not add benchmark-only fast paths to normal kernel dispatch.

Pattern Coverage Matrix

Each pattern should have a capOS-native kernel, a result verifier, and a declared authority profile. Multi-node variants stay future until network transport and distributed-capability authority are explicit.

Pattern	Single-node kernel	Multi-node shape	Verification
Static map/reduce	split fixed-size byte/block ranges across threads or processes	scatter chunks, local compute, reduce root	deterministic root hash or numeric reduction
Dynamic task pool	variable-cost tasks in a bounded deque or queue	work requests between nodes or delegated task shards	all task ids completed once, result hash, cancellation proof
Barrier phase loop	repeated phase computation with a barrier between phases	barrier across ranks or services	phase count, no early phase observation
Prefix/scan	per-thread prefix over numeric blocks	distributed scan over rank partitions	prefix checksum and boundary carry checks
Stencil/halo	1D/2D/3D grid update with neighbor halo buffers	halo exchange between rank partitions	final grid checksum and boundary oracle
Dense tiled compute	tiled matrix multiply or small LU-like update	2D block-cyclic tile distribution	matrix checksum/residual bound
Sparse iterative compute	CSR-like sparse matrix-vector plus dot products	partitioned sparse rows with global reductions	residual/checksum and iteration count
FFT/transpose	staged local FFT-like butterflies plus matrix transpose	all-to-all transpose between ranks	output checksum against reference
Sort/partition	integer bucket partition plus local sort	sample/splitter exchange and all-to-all buckets	sortedness, permutation checksum
Graph frontier	BFS-like frontier over synthetic graph	distributed frontier exchange	parent tree/level validation
Pipeline/stream	bounded producer/stage/consumer service graph	service pipeline across nodes	ordered records, backpressure, no dropped records
Collective-only	barrier, broadcast, gather, scatter, reduce, allreduce	same operations over networked ranks	collective-specific oracle and timeout behavior

Proposed Stages

Stage 0: Preserve Current Milestone Semantics

Do not broaden make run-thread-scale while it is the selected milestone gate. The current target should remain fixed-size checksum work with work/total speedup thresholds and raw KVM evidence. Pattern expansion begins after the same-process scaling milestone closes or as separate diagnostic targets that cannot be mistaken for the milestone.

Stage 1: Single-Node Pattern Kernels

Add a parallel-patterns demo crate and host harness that can run small single-node kernels under one process and multiple worker processes:

static_reduce: successor to the checksum workload, reusable as the sanity baseline.
dynamic_pool: uneven task sizes to force runtime scheduling and fairness.
barrier_loop: repeated phases to expose barrier and wakeup overhead.
scan: prefix computation to exercise ordered fan-in/fan-out.
stencil_2d: shared-buffer or private-buffer halo copies inside one node.

Each kernel prints compact structured lines with pattern, workers, cpus, input_class, verified, work, total, and relevant counters. Host harness summaries must keep raw logs under target/parallel-patterns/.

Stage 2: Memory And IPC Intensive Kernels

After MemoryObject/shared-buffer and IPC paths mature, add:

sparse_spmv: CSR-style row partition with deterministic matrix generator;
graph_bfs: synthetic graph frontier with visited-set validation;
sort_bucket: bucket partition, prefix counts, local sort, and merge verification;
pipeline_stream: bounded service stages with backpressure telemetry.

These kernels should run both thread-local and process/service forms so capOS can distinguish scheduler overhead from IPC, cap-table, shared-buffer, and service-boundary costs.

Stage 3: Capability-Native Collectives

Introduce a small collective service or library abstraction before pretending to support MPI. The first operations are:

barrier;
broadcast;
scatter/gather;
reduce/allreduce;
scan;
all-to-all for fixed-size blocks.

Collectives are benchmark subjects and future runtime building blocks, not ambient cluster authority. A caller receives only the communicator/session cap for its benchmark group. Membership, timeout, cancellation, and stale-session behavior are part of the verifier.

Stage 4: Multi-Node Harness

After capOS has a network-capability path suitable for services, add a multi-node harness that can start N capOS guests or capOS plus Linux reference guests under a controlled topology. The first target is not full MPI; it is a capOS-native rank/session model:

rank membership is represented by explicit capabilities;
transport authority is scoped to the benchmark group;
result collection includes per-node raw logs and topology metadata;
failed, slow, or stale ranks produce controlled errors instead of hanging the harness indefinitely.

Only then should capOS attempt NAS-like, HPL-like, HPCG-like, or Graph500-like profiles with clearly labeled deviations from upstream rules.

Authority And Safety Rules

A benchmark group cap grants participation, not ambient network or process authority.
Distributed pattern kernels must separate control-plane capabilities from data-plane buffers or sockets.
Every kernel must have bounded allocation, queue, and message sizes.
Timeouts and cancellation are correctness paths, not harness afterthoughts.
Result verification must fail closed before speed summaries are accepted.
Measurement features may add counters, but normal dispatch must remain the code path being evaluated unless the result is labeled as a measure build.

Reporting Format

Pattern results should extend the existing benchmark artifact conventions:

source commit, manifest, input class, worker/rank count, CPU count, run count;
host, QEMU/KVM, pinning, SMT/core topology, and network topology;
capOS authority profile and any benchmark-only feature flags;
per-run raw logs and results.csv;
median work and total windows, plus variance;
verifier status and reason for any not_applicable or diagnostic result;
comparison-system mapping, such as OpenMP taskloop, pthreads, MPI collectives, or native Linux process/thread equivalents.

Near-Term Recommendation

Do not start with HPL, HPCG, or Graph500 ports. Start with a small capOS-native parallel-patterns harness after the current thread-scale milestone closes. The first five kernels should be static_reduce, dynamic_pool, barrier_loop, scan, and stencil_2d. That set broadens coverage from “static independent chunks” to synchronization, irregular scheduling, ordered reductions, and neighbor exchange while staying within single-node capOS mechanisms.

When networking and storage mature, extend the same pattern definitions to multi-node and data-intensive variants rather than creating a parallel, unrelated benchmark suite.

Keyboard shortcuts

capOS Documentation