One algorithm — a 3D 7-point stencil — taken end to end across every parallel paradigm in the
course. We start from a textbook triple loop, add shared-memory threads, distribute it across nodes
with MPI halo exchange, move it onto a GPU, and finish with a strong/weak scaling study grounded in
Amdahl's and Gustafson's laws and a roofline estimate. Every code excerpt below is taken from the
real backends in source/src/.
MiniWeather is a deliberately small proxy for a climate / weather kernel: a single stencil sweep over a regular 3D grid, iterated for many timesteps. It is small enough to read in one sitting, but it stresses exactly the things that matter at scale — memory bandwidth, cache locality, inter-node communication, and accelerator offload.
The goal of this worked example is to take that one kernel and implement it four ways, measuring the speedup at each step. The physics never changes; only the way the work is mapped onto hardware does. That makes it the perfect vehicle for the parallel-programming arc of the syllabus: shared memory (OpenMP), distributed memory (MPI), and accelerators (CUDA), culminating in a quantitative scaling analysis.
parallel for, collapse, simd.Each cell is replaced by the average of its six face-neighbours plus four times itself, all divided by ten — a stable, diffusion-flavoured 7-point operator. Read the live 2D analogue on the visualization page.
src/stencil_cpu_serial.cpp · src/stencil_cpu_blocked.cpp ·
src/stencil_cpu_parallel.cpp · src/halo.cpp ·
src/stencil_gpu.cu. Driver: src/main.cpp; dispatch via
--backend {serial|blocked|parallel|gpu}.
collapse(2) + simd. Same node, more lanes.
stencil_cpu_parallel.cpp
The continuous model is a scalar diffusion (heat-equation-like) PDE on a 3D domain:
Discretising the Laplacian with second-order central differences on a uniform grid gives the classic 7-point stencil: the new value of a cell depends on itself and its six axis-aligned neighbours. The repository uses a specific stable weighting — centre weight 4, each neighbour weight 1, normalised by 10:
The six neighbour terms plus the weighted centre are exactly the seven array reads you will see
repeated verbatim in every backend below. The grid is stored as a flat
std::vector<double> with a one-cell halo (ghost layer) on every face,
so the index of cell (i,j,k) is
i·ny·nz + j·nz + k.
For the distributed runs the global grid is sliced into slabs along the
z axis — one contiguous slab per MPI rank. Each rank owns
local_nz = nz / size planes (the remainder is spread over the first few ranks), plus
two ghost planes it does not own but needs to read:
// Domain decomposition in z-dimension
int global_nz = cli.nz;
int local_nz = global_nz / size;
int remainder = global_nz % size;
if (rank < remainder) local_nz += 1;
// Starting z-index this rank is responsible for
int z_start = 0;
for (int r = 0; r < rank; ++r) {
int prev = global_nz / size;
if (r < remainder) prev += 1;
z_start += prev;
}
// Local arrays carry a 1-cell halo on every face
int nx = cli.nx + 2, ny = cli.ny + 2, nz = local_nz + 2;
Slicing along z (the fastest-varying / contiguous axis here) means each rank's halo is
a single nx × ny plane that is already contiguous-ish to pack — which is exactly what
makes the halo exchange in section 5 cheap.
Everything is measured against this. A single thread sweeps the interior of the grid, applies the
7-point operator into a next buffer, then ping-pong swaps current and
next for the following timestep. Double-buffering avoids read-after-write hazards: we
never overwrite a cell another cell still needs to read this step.
// Time stepping loop
for (int step = 0; step < cli.steps; ++step) {
for (int i = 1; i <= cli.nx; ++i) {
for (int j = 1; j <= cli.ny; ++j) {
for (int k = 1; k <= cli.nz; ++k) {
int idx = i * ny * nz + j * nz + k;
// 7-Point Stencil
next[idx] = (current[(i-1) * ny * nz + j * nz + k] +
current[(i+1) * ny * nz + j * nz + k] +
current[i * ny * nz + (j-1) * nz + k] +
current[i * ny * nz + (j+1) * nz + k] +
current[i * ny * nz + j * nz + (k-1)] +
current[i * ny * nz + j * nz + (k+1)] +
current[idx] * 4.0) / 10.0;
}
}
}
std::swap(current, next); // ping-pong buffers
}
The first parallelization keeps the algorithm identical but spreads iterations of the outer loops
across threads. Two pragmas do almost all the work: #pragma omp parallel for collapse(2)
fuses the i and j loops into one big iteration space so OpenMP has enough
independent work to balance across many cores, and #pragma omp simd tells the compiler
the innermost k loop is vectorizable — consecutive k values are adjacent in
memory, so the loads and the FMA can use SIMD lanes.
#ifdef MINIW_OMP
#pragma omp parallel for collapse(2)
#endif
for (int i = 1; i <= cli.nx; ++i) {
for (int j = 1; j <= cli.ny; ++j) {
#ifdef MINIW_OMP
#pragma omp simd
#endif
for (int k = 1; k <= local_nz; ++k) {
int idx = i * ny * nz + j * nz + k;
// 7-Point Stencil (identical math to the serial version)
next[idx] = (current[(i-1) * ny * nz + j * nz + k] +
current[(i+1) * ny * nz + j * nz + k] +
current[i * ny * nz + (j-1) * nz + k] +
current[i * ny * nz + (j+1) * nz + k] +
current[i * ny * nz + j * nz + (k-1)] +
current[i * ny * nz + j * nz + (k+1)] +
current[idx] * 4.0) / 10.0;
}
}
}
collapse(2) and not collapse(3)?
Collapsing only the outer two loops leaves the contiguous k loop intact for the
vectorizer and the hardware prefetcher. Collapsing all three would shatter that locality and force
a more expensive index computation per iteration. Threads get whole (i,j) columns —
big, cache-friendly chunks of work.
stencil_cpu_blocked.cpp goes further: it tiles the i,j plane into
Ti × Tj blocks sized for L1/L2, adds schedule(dynamic) for load balance,
software-prefetches k+4 ahead, and supports temporal blocking
(--tsteps). Same FLOPs, fewer cache misses — the section 3.1/3.3 optimization story made
concrete.
#ifdef MINIW_OMP
#pragma omp parallel for collapse(2) schedule(dynamic)
#endif
for (int ii = 1; ii <= nx; ii += Ti) {
for (int jj = 1; jj <= ny; jj += Tj) {
int i_end = std::min(ii + Ti - 1, nx);
int j_end = std::min(jj + Tj - 1, ny);
for (int i = ii; i <= i_end; ++i)
for (int j = jj; j <= j_end; ++j)
#ifdef MINIW_OMP
#pragma omp simd
#endif
for (int k = 1; k <= local_nz; ++k) {
if (k + 4 < NZ) __builtin_prefetch(¤t[idx(i, j, k + 4)], 0, 1);
next[idx(i, j, k)] =
( current[idx(i-1, j, k)] + current[idx(i+1, j, k)] +
current[idx(i, j-1, k)] + current[idx(i, j+1, k)] +
current[idx(i, j, k-1)] + current[idx(i, j, k+1)] +
current[idx(i, j, k)] * 4.0 ) / 10.0;
}
}
}
OpenMP runs out of road at the edge of a node: it cannot reach memory on another machine. MPI
crosses that boundary by giving each rank its own slab (section 2) and having neighbouring
ranks swap the one plane of cells they share. That plane is the halo (a.k.a. ghost
cells): rank r cannot compute the stencil on its boundary plane without the adjacent
plane that physically lives on rank r±1.
The exchange is split into a non-blocking start (post all sends/receives) and a
finish (wait + unpack). Posting MPI_Irecv/MPI_Isend first lets the
network move data while the rank computes its interior — communication/computation overlap, the
single most important MPI optimization for stencils.
void HaloExchange::start_exchange(std::vector<double>& field) {
// Pack the first owned z-plane into a contiguous send buffer
for (int i = 0; i < nx; ++i)
for (int j = 0; j < ny; ++j)
send_left[i * ny + j] = field[i * ny * nz + j * nz + 1];
// Pack the last owned z-plane
for (int i = 0; i < nx; ++i)
for (int j = 0; j < ny; ++j)
send_right[i * ny + j] = field[i * ny * nz + j * nz + (nz-2)];
// Post non-blocking recvs first, then matching sends (tags pair up neighbours)
MPI_Irecv(recv_left.data(), nx*ny, MPI_DOUBLE, left_rank, 0, MPI_COMM_WORLD, &recv_requests[0]);
MPI_Irecv(recv_right.data(), nx*ny, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD, &recv_requests[1]);
MPI_Isend(send_left.data(), nx*ny, MPI_DOUBLE, left_rank, 1, MPI_COMM_WORLD, &send_requests[0]);
MPI_Isend(send_right.data(), nx*ny, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD, &send_requests[1]);
}
void HaloExchange::finish_exchange(std::vector<double>& field) {
MPI_Waitall(2, recv_requests, MPI_STATUSES_IGNORE);
// Unpack received planes into the ghost layers (z = 0 and z = nz-1)
for (int i = 0; i < nx; ++i)
for (int j = 0; j < ny; ++j)
field[i * ny * nz + j * nz + 0] = recv_left[i * ny + j];
for (int i = 0; i < nx; ++i)
for (int j = 0; j < ny; ++j)
field[i * ny * nz + j * nz + (nz-1)] = recv_right[i * ny + j];
MPI_Waitall(2, send_requests, MPI_STATUSES_IGNORE);
}
The send to left_rank uses tag 1 while the receive from
left_rank uses tag 0 — and vice-versa on the right. This tag pairing is
what lets a rank's "right send" match its neighbour's "left receive" unambiguously, even though
every rank runs the identical code. Neighbours are computed with wrap-around
((rank ± 1 + size) % size), i.e. a periodic 1D process topology.
The driver calls start_exchange(), computes the interior (which only needs owned
cells), then calls finish_exchange() before the next step needs the halos. The same
HaloExchange object is reused by both the plain parallel and the cache-blocked CPU
backends, and the loop separately times interior vs. halo so you can see communication cost grow
with rank count.
MINIW_OMP / MINIW_MPI macros, the very same source
compiles to a hybrid build: R ranks (one per node/socket) each running T
OpenMP threads. That is the section-5.1 hybrid pattern, with no code rewrite.
On the GPU the loop nest disappears entirely. Instead of one thread walking millions of cells, we
launch millions of threads, each responsible for a single cell. The triple loop becomes three lines
of index arithmetic derived from the thread's position in the grid. The + 1 skips the
halo, and a bounds check handles grids that don't divide evenly by the block size.
__global__ void stencil_kernel(
const double* __restrict__ current,
double* __restrict__ next,
int nx, int ny, int nz,
int stride_y, int stride_z)
{
// 3D thread indexing (+1 skips the halo)
int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
int j = blockIdx.y * blockDim.y + threadIdx.y + 1;
int k = blockIdx.z * blockDim.z + threadIdx.z + 1;
if (i <= nx && j <= ny && k <= nz) {
int idx = i * stride_y * stride_z + j * stride_z + k;
next[idx] = (
current[(i-1) * stride_y * stride_z + j * stride_z + k] +
current[(i+1) * stride_y * stride_z + j * stride_z + k] +
current[i * stride_y * stride_z + (j-1) * stride_z + k] +
current[i * stride_y * stride_z + (j+1) * stride_z + k] +
current[i * stride_y * stride_z + j * stride_z + (k-1)] +
current[i * stride_y * stride_z + j * stride_z + (k+1)] +
current[idx] * 4.0
) / 10.0;
}
}
Threads are grouped into 8×8×8 = 512-thread blocks; the grid is sized to cover the
whole domain with a ceiling division. Each step launches the kernel, synchronises, and pointer-swaps
the device buffers — the GPU analogue of the serial ping-pong.
Because k is the contiguous axis and maps to threadIdx.x's fastest
dimension, neighbouring threads read neighbouring addresses — coalesced global loads. A
second kernel, stencil_kernel_shared, stages an 10³ tile (8³ interior + 1
halo each side) into __shared__ memory so the six neighbour reads hit on-chip memory
instead of re-reading global DRAM.
// Kernel configuration
dim3 block(8, 8, 8);
dim3 grid(
(nx + block.x - 1) / block.x,
(ny + block.y - 1) / block.y,
(local_nz + block.z - 1) / block.z
);
for (int step = 0; step < cli.steps; ++step) {
stencil_kernel<<<grid, block>>>(
d_current, d_next, nx, ny, local_nz, NY, NZ);
cudaDeviceSynchronize();
std::swap(d_current, d_next); // swap device pointers
}
cudaSetDevice(rank % device_count)) so it is multi-GPU-ready, but the
per-step inter-GPU halo exchange is still marked // TODO. Single-GPU and single-node
multi-GPU-without-exchange runs are correct today; closing that TODO is the first extension below.
With all four backends in hand we run a sweep and reduce it to two questions. Strong
scaling: fix the problem (here 64³ for 100 steps) and add workers — does it get
faster? Weak scaling: grow the problem in lockstep with the workers — does each step
stay the same length? The numbers below are representative figures used to illustrate the
method (the repo ships the harness — run.sh scale and the submit_scale.slurm
script — for collecting your own on the cluster; RESULTS.md is the template you fill in).
| Configuration | Workers | Time (s) | Speedup S | Efficiency E | GLUPS |
|---|---|---|---|---|---|
| Serial baseline | 1 | 12.80 | 1.00× | 100% | 0.21 |
| OpenMP (1 node) | 4 thr | 3.55 | 3.61× | 90% | 0.74 |
| OpenMP (1 node) | 8 thr | 2.05 | 6.24× | 78% | 1.28 |
| MPI (1 rank/node) | 4 ranks | 3.70 | 3.46× | 86% | 0.71 |
| MPI | 8 ranks | 2.15 | 5.95× | 74% | 1.22 |
| MPI | 16 ranks | 1.45 | 8.83× | 55% | 1.81 |
| CUDA (1 GPU) | 1 GPU | 0.42 | 30.5× | — | 6.25 |
| Ranks | Global grid | Time/step (ms) | Weak efficiency |
|---|---|---|---|
| 1 | 64³ | 36.0 | 100% |
| 2 | 64²×128 | 37.4 | 96% |
| 4 | 64²×256 | 38.9 | 93% |
| 8 | 64²×512 | 40.6 | 89% |
If a fraction p of the work is parallelizable and (1−p) is inherently serial, the best speedup on N workers is bounded:
Back-solving from the 8-thread OpenMP row (S ≈ 6.24 at N = 8) gives p ≈ 0.96 — about 4% serial overhead (allocation, the constant-field check, loop ramp-up). That caps the maximum attainable speedup at roughly 1/(1−0.96) ≈ 25× no matter how many cores you add. This is why strong scaling alone always eventually stalls.
Gustafson reframes the question: on a bigger machine you usually solve a bigger problem, not the same one faster. The scaled speedup grows linearly with N:
With the same p ≈ 0.96, eight ranks of weak scaling give a scaled speedup of about 8 − 0.04·7 ≈ 7.7 — comfortably above what Amdahl permits for the fixed-size problem. The 89% weak-efficiency row is the experimental shadow of exactly this law: MiniWeather is a weak-scaling success story, which is precisely why real climate codes scale to thousands of nodes by refining the mesh.
Whether any of this is even worth optimizing depends on whether the kernel is compute-bound or memory-bound. The 7-point stencil does 8 floating-point operations per cell update (6 adds + 1 multiply + 1 divide). In the streaming limit it must read the centre and write the result — 2 doubles = 16 bytes of unavoidable DRAM traffic per update (neighbour reads are ideally served from cache). Arithmetic intensity is therefore:
On the roofline model, performance is min(peak FLOP/s, I × bandwidth). At I = 0.5, even a modest 100 GB/s memory system caps the kernel at ≈ 50 GFLOP/s — far below any modern CPU/GPU's peak FLOP rate. MiniWeather is firmly memory-bandwidth-bound. Three consequences follow directly, and explain every optimization above:
Feature backends are toggled at configure time, so a single binary miniweather can
embed serial, OpenMP, MPI and (optionally) CUDA paths:
# one-shot helper (loads modules, configures, builds)
./run.sh build
# or by hand:
mkdir build && cd build
cmake .. -DENABLE_MPI=ON -DENABLE_OPENMP=ON -DENABLE_CUDA=OFF
make -j$(nproc)
# serial reference
./build/miniweather --backend serial -nx 64 -ny 64 -nz 64 -t 100
# OpenMP: 8 threads on one node
OMP_NUM_THREADS=8 ./build/miniweather --backend parallel -nx 64 -ny 64 -nz 64 -t 100
# MPI: 4 ranks
mpirun -np 4 ./build/miniweather --backend parallel -nx 64 -ny 64 -nz 64 -t 100
# GPU (requires -DENABLE_CUDA=ON build)
./build/miniweather --backend gpu -nx 128 -ny 128 -nz 128 -t 100
The repo ships ready-made batch scripts under scripts/ and slurm/
(serial, parallel, GPU, strong- and weak-scaling sweeps). The scaling submit script builds on
first run and then launches across nodes with srun:
#!/bin/bash
#SBATCH -J miniweather-scaling-cpu
#SBATCH --output=results/scaling%j.out
#SBATCH -N 4
#SBATCH --ntasks=4
#SBATCH --mem-per-cpu=512M
#SBATCH -t 00:15:00
cd $SLURM_SUBMIT_DIR
module purge
module load openmpi/4.1 gcc/12 cmake
if [ ! -f build/miniweather ]; then
mkdir -p build && cd build
cmake .. -DENABLE_MPI=ON -DENABLE_OPENMP=ON
make -j
cd ..
fi
echo "Test 3: 16 ranks (4 nodes)"
srun -N 4 -n 16 ./build/miniweather --backend parallel -nx 64 -ny 64 -nz 64 -t 10
sbatch scripts/submit_scale.slurm
sbatch scripts/submit_gpu.slurm
./run.sh scale — sweeps 1·2·4·8 ranks, writes a CSV
The resulting CSVs feed scripts/plot_results.py and
scripts/visualize_weather.py, which render the speedup tables and the field images you
see on the visualization page.
Every stage of this project is a concrete instance of a specific course session. The full outline lives on the course-structure page; the mapping below shows the through-line.
The OpenMP-vs-MPI split is the entire spine of the project: one address space vs. message passing.
Slab decomposition, non-blocking Isend/Irecv, Waitall, tag pairing, periodic neighbours.
parallel for collapse(2), simd, schedule(dynamic) — and why each was chosen.
The loop nest becomes a 3D thread grid; coalesced loads and a shared-memory tiling variant.
Cache blocking, prefetch, data locality, and the roofline / arithmetic-intensity analysis.
Per-phase timers (interior vs. halo), GLUPS, the strong/weak scaling sweep, perf stat.
The macro-guarded source compiles to a hybrid MPI+OpenMP build with no rewrite.
MiniWeather is the climate-systems hybrid-programming case study from this session.
The CUDA backend already does cudaSetDevice(rank % device_count) but skips the inter-GPU
halo. Add a device-to-device exchange each step (stage halos to host, or go GPU-aware) to scale across
all GPUs on a node and across nodes.
With a CUDA-aware OpenMPI build you can pass device pointers straight to
MPI_Isend/Irecv, letting the runtime use NVLink / GPUDirect RDMA and skip
the host bounce entirely — the biggest win for the memory-bound halo.
Sweep ranks × threads (e.g. 1×16, 2×8, 4×4) and pin with OMP_PROC_BIND=close +
--map-by socket. One rank per NUMA domain usually beats both extremes — section 5.1 made
measurable.
The current slab cut is 1D in z. A 2D or 3D MPI_Cart_create topology
shrinks the surface-to-volume ratio at high rank counts, directly attacking the strong-scaling
roll-off seen in the 16-rank row.
Swap the toy diffusion operator for the actual MiniWeather advection + buoyancy terms (the visualization already adds wind/buoyancy in 2D) to make it a genuine compressible-flow miniapp.
Add MPI-IO or HDF5 to dump fields for visualization and to checkpoint long runs — sessions 3.2 and 3.4 — instead of relying on the in-memory constant-field test.
halo.cpp)