Research: HPC Parallel Patterns

This note grounds the capOS proposal for generic parallel processing pattern coverage. It is not a request to port full HPC suites immediately. The point is to classify which algorithm shapes capOS benchmarks should eventually cover so future SMP, threading, runtime, storage, network, and multi-node claims do not rest only on embarrassingly parallel worker loops.

Source Set

The Berkeley “View” report argues that parallel programming systems need multiple styles of parallelism, and uses the dwarf taxonomy to describe important computational kernels rather than one benchmark score: https://people.eecs.berkeley.edu/~krste/papers/BerkeleyView.pdf.
NASA’s NAS Parallel Benchmarks cover EP, IS, CG, MG, FT, BT, SP, LU, UA, DC, and DT across MPI, OpenMP, serial, and hybrid variants: https://www.nas.nasa.gov/software/npb.html.
TOP500 describes LINPACK/HPL as dense linear-system evidence and warns that one number cannot describe overall system performance: https://www.top500.org/project/linpack/.
Netlib’s HPL page identifies HPL as a distributed-memory double-precision dense linear-system implementation: https://netlib.sandia.gov/benchmark/hpl/.
HPCG complements HPL by exercising sparse matrix-vector multiplication, vector updates, global dot products, Gauss-Seidel smoothing, triangular solve, and multigrid-preconditioned conjugate gradient: https://www.hpcg-benchmark.org/.
Graph500 covers graph construction, breadth-first search, and shortest-path kernels, with shared-memory, distributed-memory, and external-memory/cloud variants: https://graph500.org/?page_id=12.
MPI 4.1 collective communication names the standard multi-rank movement and computation patterns: barrier, broadcast, gather, scatter, allgather, all-to-all, allreduce/reduce, reduce-scatter, and scan: https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report/node114.htm.
OpenMP 5.2 covers node-local loop/task/SIMD/reduction mechanisms, including taskloop partitioning loop iterations into explicit tasks and reduction clauses for parallel recurrence calculations: https://www.openmp.org/spec-html/5.2/openmp.html, https://www.openmp.org/spec-html/5.2/openmpse74.html, and https://www.openmp.org/spec-html/5.2/openmpse27.html.

Consequences For capOS

The current capOS CPU-scaling benchmarks are necessary but narrow. They exercise static worker partitioning, final result verification, and a small amount of spawn/join or process-wait coordination. That covers one important HPC pattern: independent tasks with a final reduction. It does not cover:

structured grids and stencil/halo exchange;
dense tiled matrix work;
sparse matrix and irregular memory access;
FFT/transposes and global all-to-all style communication;
graph frontier expansion and high-fanout irregular queues;
task graphs with dependency scheduling and cancellation;
collectives as first-class operations;
multi-node communication and authority boundaries.

The benchmark plan should therefore treat “parallel processing” as a matrix of patterns rather than a single scaling demo. A useful capOS coverage target is:

Pattern family	Source precedent	capOS evidence it should force
Static map/reduce	OpenMP loop/reduction, NAS EP	low-overhead thread/process creation, result aggregation, no hot-path syscalls
Dynamic task graph	OpenMP tasks, Berkeley composition point	work queues, cancellation, dependency fan-in/fan-out, scheduler fairness under uneven tasks
Stencil and halo exchange	NAS MG/BT/SP/LU	shared buffers, neighbor exchange, barriers, cache locality, future network transport
Dense tiled linear algebra	HPL/LINPACK	compute locality, tile scheduling, reductions, optional SIMD/library runtime support
Sparse iterative solver	HPCG, NAS CG	irregular memory access, sparse matrix-vector work, global dot-product reductions
FFT/transpose	NAS FT	all-to-all movement, temporary buffers, memory pressure, future multi-node transpose
Sort/partition	NAS IS	all-to-all buckets, prefix/scan, allocator and queue pressure
Graph frontier	Graph500	irregular frontier queues, atomic-like visited updates, high fanout, load imbalance
Collective communication	MPI collectives	barrier, broadcast, scatter/gather, reduce/allreduce, all-to-all semantics
Pipeline/stream	Berkeley composition point, future service graphs	bounded queues, backpressure, stage-local authority, telemetry

The near-term capOS subset should stay CPU-only and single-node until the selected in-process threading milestone is closed. The first expansion should add pattern kernels that reuse existing userspace/runtime mechanisms, then let future networking and storage milestones add multi-node and data-intensive variants.

Keyboard shortcuts

capOS Documentation

Research: HPC Parallel Patterns

Source Set

Consequences For capOS