Research: HPC Parallel Patterns
This note grounds the capOS proposal for generic parallel processing pattern coverage. It is not a request to port full HPC suites immediately. The point is to classify which algorithm shapes capOS benchmarks should eventually cover so future SMP, threading, runtime, storage, network, and multi-node claims do not rest only on embarrassingly parallel worker loops.
Source Set
- The Berkeley “View” report argues that parallel programming systems need multiple styles of parallelism, and uses the dwarf taxonomy to describe important computational kernels rather than one benchmark score: https://people.eecs.berkeley.edu/~krste/papers/BerkeleyView.pdf.
- NASA’s NAS Parallel Benchmarks cover EP, IS, CG, MG, FT, BT, SP, LU, UA, DC, and DT across MPI, OpenMP, serial, and hybrid variants: https://www.nas.nasa.gov/software/npb.html.
- TOP500 describes LINPACK/HPL as dense linear-system evidence and warns that one number cannot describe overall system performance: https://www.top500.org/project/linpack/.
- Netlib’s HPL page identifies HPL as a distributed-memory double-precision dense linear-system implementation: https://netlib.sandia.gov/benchmark/hpl/.
- HPCG complements HPL by exercising sparse matrix-vector multiplication, vector updates, global dot products, Gauss-Seidel smoothing, triangular solve, and multigrid-preconditioned conjugate gradient: https://www.hpcg-benchmark.org/.
- Graph500 covers graph construction, breadth-first search, and shortest-path kernels, with shared-memory, distributed-memory, and external-memory/cloud variants: https://graph500.org/?page_id=12.
- MPI 4.1 collective communication names the standard multi-rank movement and computation patterns: barrier, broadcast, gather, scatter, allgather, all-to-all, allreduce/reduce, reduce-scatter, and scan: https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report/node114.htm.
- OpenMP 5.2 covers node-local loop/task/SIMD/reduction mechanisms, including
tasklooppartitioning loop iterations into explicit tasks and reduction clauses for parallel recurrence calculations: https://www.openmp.org/spec-html/5.2/openmp.html, https://www.openmp.org/spec-html/5.2/openmpse74.html, and https://www.openmp.org/spec-html/5.2/openmpse27.html.
Consequences For capOS
The current capOS CPU-scaling benchmarks are necessary but narrow. They exercise static worker partitioning, final result verification, and a small amount of spawn/join or process-wait coordination. That covers one important HPC pattern: independent tasks with a final reduction. It does not cover:
- structured grids and stencil/halo exchange;
- dense tiled matrix work;
- sparse matrix and irregular memory access;
- FFT/transposes and global all-to-all style communication;
- graph frontier expansion and high-fanout irregular queues;
- task graphs with dependency scheduling and cancellation;
- collectives as first-class operations;
- multi-node communication and authority boundaries.
The benchmark plan should therefore treat “parallel processing” as a matrix of patterns rather than a single scaling demo. A useful capOS coverage target is:
| Pattern family | Source precedent | capOS evidence it should force |
|---|---|---|
| Static map/reduce | OpenMP loop/reduction, NAS EP | low-overhead thread/process creation, result aggregation, no hot-path syscalls |
| Dynamic task graph | OpenMP tasks, Berkeley composition point | work queues, cancellation, dependency fan-in/fan-out, scheduler fairness under uneven tasks |
| Stencil and halo exchange | NAS MG/BT/SP/LU | shared buffers, neighbor exchange, barriers, cache locality, future network transport |
| Dense tiled linear algebra | HPL/LINPACK | compute locality, tile scheduling, reductions, optional SIMD/library runtime support |
| Sparse iterative solver | HPCG, NAS CG | irregular memory access, sparse matrix-vector work, global dot-product reductions |
| FFT/transpose | NAS FT | all-to-all movement, temporary buffers, memory pressure, future multi-node transpose |
| Sort/partition | NAS IS | all-to-all buckets, prefix/scan, allocator and queue pressure |
| Graph frontier | Graph500 | irregular frontier queues, atomic-like visited updates, high fanout, load imbalance |
| Collective communication | MPI collectives | barrier, broadcast, scatter/gather, reduce/allreduce, all-to-all semantics |
| Pipeline/stream | Berkeley composition point, future service graphs | bounded queues, backpressure, stage-local authority, telemetry |
The near-term capOS subset should stay CPU-only and single-node until the selected in-process threading milestone is closed. The first expansion should add pattern kernels that reuse existing userspace/runtime mechanisms, then let future networking and storage milestones add multi-node and data-intensive variants.