Proposal: HPC Parallel Processing Patterns
capOS should grow from focused SMP/threading speedup proofs into a correctness-gated suite of generic parallel processing patterns. The suite should cover the single-node and multi-node algorithm shapes commonly used in HPC without pretending that capOS already supports MPI, POSIX files, shared memory libraries, or cluster networking.
This proposal extends System Performance Benchmarks. It does not replace the selected In-Process Threading Scalability milestone. The immediate milestone remains proving same-process sibling thread speedup on the existing fixed-size checksum workload.
Design Grounding
Local grounding:
- Benchmarks
- System Performance Benchmarks
- SMP Phase C
- SMP
- Ring v2 For Full SMP
- Scheduler Evolution
- In-Process Threading
- HPC Parallel Patterns
External grounding summarized in the research note covers Berkeley dwarfs, NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and OpenMP loop/task/reduction constructs.
Current capOS Benchmark Analysis
Current CPU-scaling evidence is useful but narrow:
make run-smp-process-scaleproves independent worker processes can improve from one to two CPUs under QEMU/KVM. It is accepted for the completed Multi-Process SMP Concurrency milestone. Its prime-counting workload is static partition plus final verification, and the 4-vCPU/SMT rows remain diagnostic rather than accepted scaling claims.make run-thread-scaleuses a fixed-size checksum workload with per-thread rings and guest phase counters. It remainsaccepted=false: two-thread work is still roughly flat against one thread, total/shutdown cost grows with worker count, and SMT8 rows are informational on the 4-core/8-thread benchmark host.- Native Linux pthread baselines show the same checksum shape can scale in some cases, but also expose coordinator/oversubscription sensitivity. Larger workloads help separate Amdahl effects from thread lifecycle overhead.
- Guest measurement now covers scheduler, serial, scheduler-lock, timer, TLB, and user-PC attribution, but the workload still represents only static partitioned CPU work.
So the current suite covers one pattern well: independent fixed chunks with a final checksum/reduction. It does not yet cover dynamic task scheduling, barriers, prefix/scan, all-to-all movement, stencils, sparse/dense kernels, graph frontiers, pipelines, or multi-node communication.
Goals
- Classify parallel benchmark coverage by algorithm pattern, not by a single score or one “HPC benchmark” label.
- Keep correctness and authority gates ahead of speed claims.
- Provide single-node pattern kernels before multi-node transport exists.
- Give scheduler, runtime, memory, IPC, storage, and networking work concrete future coverage targets.
- Allow Linux/FreeBSD/MPI/OpenMP comparisons only when the semantic mapping is declared.
Non-Goals
- Do not port MPI or OpenMP as a prerequisite for the first pattern kernels.
- Do not run full HPL, HPCG, NAS, or Graph500 before capOS has the required runtime, memory, file/store, and network substrate.
- Do not add POSIX compatibility or ambient filesystem authority only to run a familiar suite.
- Do not count SMT diagnostics as core-count scaling evidence.
- Do not add benchmark-only fast paths to normal kernel dispatch.
Pattern Coverage Matrix
Each pattern should have a capOS-native kernel, a result verifier, and a declared authority profile. Multi-node variants stay future until network transport and distributed-capability authority are explicit.
| Pattern | Single-node kernel | Multi-node shape | Verification |
|---|---|---|---|
| Static map/reduce | split fixed-size byte/block ranges across threads or processes | scatter chunks, local compute, reduce root | deterministic root hash or numeric reduction |
| Dynamic task pool | variable-cost tasks in a bounded deque or queue | work requests between nodes or delegated task shards | all task ids completed once, result hash, cancellation proof |
| Barrier phase loop | repeated phase computation with a barrier between phases | barrier across ranks or services | phase count, no early phase observation |
| Prefix/scan | per-thread prefix over numeric blocks | distributed scan over rank partitions | prefix checksum and boundary carry checks |
| Stencil/halo | 1D/2D/3D grid update with neighbor halo buffers | halo exchange between rank partitions | final grid checksum and boundary oracle |
| Dense tiled compute | tiled matrix multiply or small LU-like update | 2D block-cyclic tile distribution | matrix checksum/residual bound |
| Sparse iterative compute | CSR-like sparse matrix-vector plus dot products | partitioned sparse rows with global reductions | residual/checksum and iteration count |
| FFT/transpose | staged local FFT-like butterflies plus matrix transpose | all-to-all transpose between ranks | output checksum against reference |
| Sort/partition | integer bucket partition plus local sort | sample/splitter exchange and all-to-all buckets | sortedness, permutation checksum |
| Graph frontier | BFS-like frontier over synthetic graph | distributed frontier exchange | parent tree/level validation |
| Pipeline/stream | bounded producer/stage/consumer service graph | service pipeline across nodes | ordered records, backpressure, no dropped records |
| Collective-only | barrier, broadcast, gather, scatter, reduce, allreduce | same operations over networked ranks | collective-specific oracle and timeout behavior |
Proposed Stages
Stage 0: Preserve Current Milestone Semantics
Do not broaden make run-thread-scale while it is the selected milestone gate.
The current target should remain fixed-size checksum work with work/total
speedup thresholds and raw KVM evidence. Pattern expansion begins after the
same-process scaling milestone closes or as separate diagnostic targets that
cannot be mistaken for the milestone.
Stage 1: Single-Node Pattern Kernels
Add a parallel-patterns demo crate and host harness that can run small
single-node kernels under one process and multiple worker processes:
static_reduce: successor to the checksum workload, reusable as the sanity baseline.dynamic_pool: uneven task sizes to force runtime scheduling and fairness.barrier_loop: repeated phases to expose barrier and wakeup overhead.scan: prefix computation to exercise ordered fan-in/fan-out.stencil_2d: shared-buffer or private-buffer halo copies inside one node.
Each kernel prints compact structured lines with pattern, workers,
cpus, input_class, verified, work, total, and relevant counters.
Host harness summaries must keep raw logs under target/parallel-patterns/.
Stage 2: Memory And IPC Intensive Kernels
After MemoryObject/shared-buffer and IPC paths mature, add:
sparse_spmv: CSR-style row partition with deterministic matrix generator;graph_bfs: synthetic graph frontier with visited-set validation;sort_bucket: bucket partition, prefix counts, local sort, and merge verification;pipeline_stream: bounded service stages with backpressure telemetry.
These kernels should run both thread-local and process/service forms so capOS can distinguish scheduler overhead from IPC, cap-table, shared-buffer, and service-boundary costs.
Stage 3: Capability-Native Collectives
Introduce a small collective service or library abstraction before pretending to support MPI. The first operations are:
- barrier;
- broadcast;
- scatter/gather;
- reduce/allreduce;
- scan;
- all-to-all for fixed-size blocks.
Collectives are benchmark subjects and future runtime building blocks, not ambient cluster authority. A caller receives only the communicator/session cap for its benchmark group. Membership, timeout, cancellation, and stale-session behavior are part of the verifier.
Stage 4: Multi-Node Harness
After capOS has a network-capability path suitable for services, add a multi-node harness that can start N capOS guests or capOS plus Linux reference guests under a controlled topology. The first target is not full MPI; it is a capOS-native rank/session model:
- rank membership is represented by explicit capabilities;
- transport authority is scoped to the benchmark group;
- result collection includes per-node raw logs and topology metadata;
- failed, slow, or stale ranks produce controlled errors instead of hanging the harness indefinitely.
Only then should capOS attempt NAS-like, HPL-like, HPCG-like, or Graph500-like profiles with clearly labeled deviations from upstream rules.
Authority And Safety Rules
- A benchmark group cap grants participation, not ambient network or process authority.
- Distributed pattern kernels must separate control-plane capabilities from data-plane buffers or sockets.
- Every kernel must have bounded allocation, queue, and message sizes.
- Timeouts and cancellation are correctness paths, not harness afterthoughts.
- Result verification must fail closed before speed summaries are accepted.
- Measurement features may add counters, but normal dispatch must remain the code path being evaluated unless the result is labeled as a measure build.
Reporting Format
Pattern results should extend the existing benchmark artifact conventions:
- source commit, manifest, input class, worker/rank count, CPU count, run count;
- host, QEMU/KVM, pinning, SMT/core topology, and network topology;
- capOS authority profile and any benchmark-only feature flags;
- per-run raw logs and
results.csv; - median work and total windows, plus variance;
- verifier status and reason for any
not_applicableordiagnosticresult; - comparison-system mapping, such as OpenMP taskloop, pthreads, MPI collectives, or native Linux process/thread equivalents.
Near-Term Recommendation
Do not start with HPL, HPCG, or Graph500 ports. Start with a small
capOS-native parallel-patterns harness after the current thread-scale
milestone closes. The first five kernels should be static_reduce,
dynamic_pool, barrier_loop, scan, and stencil_2d. That set broadens
coverage from “static independent chunks” to synchronization, irregular
scheduling, ordered reductions, and neighbor exchange while staying within
single-node capOS mechanisms.
When networking and storage mature, extend the same pattern definitions to multi-node and data-intensive variants rather than creating a parallel, unrelated benchmark suite.