Worked Example · Parallelizing MiniWeather · HPC

1 · Overview

What this project does, and why

MiniWeather is a deliberately small proxy for a climate / weather kernel: a single stencil sweep over a regular 3D grid, iterated for many timesteps. It is small enough to read in one sitting, but it stresses exactly the things that matter at scale — memory bandwidth, cache locality, inter-node communication, and accelerator offload.

The goal of this worked example is to take that one kernel and implement it four ways, measuring the speedup at each step. The physics never changes; only the way the work is mapped onto hardware does. That makes it the perfect vehicle for the parallel-programming arc of the syllabus: shared memory (OpenMP), distributed memory (MPI), and accelerators (CUDA), culminating in a quantitative scaling analysis.

Course sessions exercised

2.1 Distributed vs. shared memory — the conceptual split this whole project rests on.
2.2 MPI programming — domain decomposition + non-blocking halo exchange.
2.3 OpenMP — parallel for, collapse, simd.
2.4 GPU computing / CUDA — the stencil as a kernel with a 3D grid of threads.
3.1–3.3 Optimization — cache blocking, data locality, the roofline model.
5.1 / 5.5 Hybrid MPI+OpenMP and climate-systems modelling (the extensions).

The stencil, in one line.

Each cell is replaced by the average of its six face-neighbours plus four times itself, all divided by ten — a stable, diffusion-flavoured 7-point operator. Read the live 2D analogue on the visualization page.

Stack

C++17CMakeOpenMPMPI (OpenMPI 4.1) CUDA / HIPSLURMSIMD

Where the code lives.

src/stencil_cpu_serial.cpp · src/stencil_cpu_blocked.cpp · src/stencil_cpu_parallel.cpp · src/halo.cpp · src/stencil_gpu.cu. Driver: src/main.cpp; dispatch via --backend {serial|blocked|parallel|gpu}.

Step 1Serial One thread, triple loop. The reference for correctness and the denominator for every speedup. stencil_cpu_serial.cpp

→

Step 2OpenMP Threads across cores with collapse(2) + simd. Same node, more lanes. stencil_cpu_parallel.cpp

→

Step 3MPI Slab decomposition in z; non-blocking halo exchange between rank neighbours. halo.cpp

→

Step 4CUDA One cell ↦ one thread; 8×8×8 blocks; optional shared-memory tiling. stencil_gpu.cu

2 · The MiniWeather problem

A diffusion stencil and how we cut it up

The continuous model is a scalar diffusion (heat-equation-like) PDE on a 3D domain:

Continuous PDE $$\frac{\partial u}{\partial t} = \alpha\,\nabla^2 u = \alpha\!\left(\frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} + \frac{\partial^2 u}{\partial z^2}\right)$$

Discretising the Laplacian with second-order central differences on a uniform grid gives the classic 7-point stencil: the new value of a cell depends on itself and its six axis-aligned neighbours. The repository uses a specific stable weighting — centre weight 4, each neighbour weight 1, normalised by 10:

Discrete update (the kernel) $$u^{\,n+1}_{i,j,k} = \frac{1}{10}\Big( u^{\,n}_{i\pm1,j,k} + u^{\,n}_{i,j\pm1,k} + u^{\,n}_{i,j,k\pm1} + 4\,u^{\,n}_{i,j,k}\Big)$$

The six neighbour terms plus the weighted centre are exactly the seven array reads you will see repeated verbatim in every backend below. The grid is stored as a flat std::vector<double> with a one-cell halo (ghost layer) on every face, so the index of cell (i,j,k) is i·ny·nz + j·nz + k.

Domain decomposition

For the distributed runs the global grid is sliced into slabs along the z axis — one contiguous slab per MPI rank. Each rank owns local_nz = nz / size planes (the remainder is spread over the first few ranks), plus two ghost planes it does not own but needs to read:

src/stencil_cpu_parallel.cppreal

// Domain decomposition in z-dimension
int global_nz = cli.nz;
int local_nz  = global_nz / size;
int remainder = global_nz % size;
if (rank < remainder) local_nz += 1;

// Starting z-index this rank is responsible for
int z_start = 0;
for (int r = 0; r < rank; ++r) {
    int prev = global_nz / size;
    if (r < remainder) prev += 1;
    z_start += prev;
}

// Local arrays carry a 1-cell halo on every face
int nx = cli.nx + 2, ny = cli.ny + 2, nz = local_nz + 2;

Slicing along z (the fastest-varying / contiguous axis here) means each rank's halo is a single nx × ny plane that is already contiguous-ish to pack — which is exactly what makes the halo exchange in section 5 cheap.

3 · Serial baseline

The reference implementation

Everything is measured against this. A single thread sweeps the interior of the grid, applies the 7-point operator into a next buffer, then ping-pong swaps current and next for the following timestep. Double-buffering avoids read-after-write hazards: we never overwrite a cell another cell still needs to read this step.

src/stencil_cpu_serial.cpp · run_stencil_cpu_serial()real

// Time stepping loop
for (int step = 0; step < cli.steps; ++step) {
    for (int i = 1; i <= cli.nx; ++i) {
        for (int j = 1; j <= cli.ny; ++j) {
            for (int k = 1; k <= cli.nz; ++k) {
                int idx = i * ny * nz + j * nz + k;

                // 7-Point Stencil
                next[idx] = (current[(i-1) * ny * nz + j * nz + k] +
                             current[(i+1) * ny * nz + j * nz + k] +
                             current[i * ny * nz + (j-1) * nz + k] +
                             current[i * ny * nz + (j+1) * nz + k] +
                             current[i * ny * nz + j * nz + (k-1)] +
                             current[i * ny * nz + j * nz + (k+1)] +
                             current[idx] * 4.0) / 10.0;
            }
        }
    }
    std::swap(current, next);   // ping-pong buffers
}

The headline metric is GLUPS — billions of grid-point updates per second: glups = nx·ny·nz·steps / time / 1e9. Because every backend computes the identical operator, GLUPS is directly comparable across all four, and a built-in constant-field test (a uniform field must stay uniform under this operator) catches correctness regressions.

4 · Shared memory · OpenMP

Threads across the cores of one node

The first parallelization keeps the algorithm identical but spreads iterations of the outer loops across threads. Two pragmas do almost all the work: #pragma omp parallel for collapse(2) fuses the i and j loops into one big iteration space so OpenMP has enough independent work to balance across many cores, and #pragma omp simd tells the compiler the innermost k loop is vectorizable — consecutive k values are adjacent in memory, so the loads and the FMA can use SIMD lanes.

src/stencil_cpu_parallel.cpp · interior computereal

#ifdef MINIW_OMP
#pragma omp parallel for collapse(2)
#endif
for (int i = 1; i <= cli.nx; ++i) {
    for (int j = 1; j <= cli.ny; ++j) {
#ifdef MINIW_OMP
#pragma omp simd
#endif
        for (int k = 1; k <= local_nz; ++k) {
            int idx = i * ny * nz + j * nz + k;

            // 7-Point Stencil (identical math to the serial version)
            next[idx] = (current[(i-1) * ny * nz + j * nz + k] +
                         current[(i+1) * ny * nz + j * nz + k] +
                         current[i * ny * nz + (j-1) * nz + k] +
                         current[i * ny * nz + (j+1) * nz + k] +
                         current[i * ny * nz + j * nz + (k-1)] +
                         current[i * ny * nz + j * nz + (k+1)] +
                         current[idx] * 4.0) / 10.0;
        }
    }
}

Why `collapse(2)` and not `collapse(3)`?

Collapsing only the outer two loops leaves the contiguous k loop intact for the vectorizer and the hardware prefetcher. Collapsing all three would shatter that locality and force a more expensive index computation per iteration. Threads get whole (i,j) columns — big, cache-friendly chunks of work.

The cache-blocked cousin

stencil_cpu_blocked.cpp goes further: it tiles the i,j plane into Ti × Tj blocks sized for L1/L2, adds schedule(dynamic) for load balance, software-prefetches k+4 ahead, and supports temporal blocking (--tsteps). Same FLOPs, fewer cache misses — the section 3.1/3.3 optimization story made concrete.

src/stencil_cpu_blocked.cpp · tiled + prefetched loop nestreal

#ifdef MINIW_OMP
#pragma omp parallel for collapse(2) schedule(dynamic)
#endif
for (int ii = 1; ii <= nx; ii += Ti) {
    for (int jj = 1; jj <= ny; jj += Tj) {
        int i_end = std::min(ii + Ti - 1, nx);
        int j_end = std::min(jj + Tj - 1, ny);
        for (int i = ii; i <= i_end; ++i)
          for (int j = jj; j <= j_end; ++j)
#ifdef MINIW_OMP
#pragma omp simd
#endif
            for (int k = 1; k <= local_nz; ++k) {
                if (k + 4 < NZ) __builtin_prefetch(&current[idx(i, j, k + 4)], 0, 1);
                next[idx(i, j, k)] =
                    ( current[idx(i-1, j, k)] + current[idx(i+1, j, k)] +
                      current[idx(i, j-1, k)] + current[idx(i, j+1, k)] +
                      current[idx(i, j, k-1)] + current[idx(i, j, k+1)] +
                      current[idx(i, j, k)] * 4.0 ) / 10.0;
            }
    }
}

5 · Distributed memory · MPI

Across nodes, with a halo exchange

OpenMP runs out of road at the edge of a node: it cannot reach memory on another machine. MPI crosses that boundary by giving each rank its own slab (section 2) and having neighbouring ranks swap the one plane of cells they share. That plane is the halo (a.k.a. ghost cells): rank r cannot compute the stencil on its boundary plane without the adjacent plane that physically lives on rank r±1.

The exchange is split into a non-blocking start (post all sends/receives) and a finish (wait + unpack). Posting MPI_Irecv/MPI_Isend first lets the network move data while the rank computes its interior — communication/computation overlap, the single most important MPI optimization for stencils.

src/halo.cpp · HaloExchange::start_exchange()real

void HaloExchange::start_exchange(std::vector<double>& field) {
    // Pack the first owned z-plane into a contiguous send buffer
    for (int i = 0; i < nx; ++i)
        for (int j = 0; j < ny; ++j)
            send_left[i * ny + j]  = field[i * ny * nz + j * nz + 1];

    // Pack the last owned z-plane
    for (int i = 0; i < nx; ++i)
        for (int j = 0; j < ny; ++j)
            send_right[i * ny + j] = field[i * ny * nz + j * nz + (nz-2)];

    // Post non-blocking recvs first, then matching sends (tags pair up neighbours)
    MPI_Irecv(recv_left.data(),  nx*ny, MPI_DOUBLE, left_rank,  0, MPI_COMM_WORLD, &recv_requests[0]);
    MPI_Irecv(recv_right.data(), nx*ny, MPI_DOUBLE, right_rank, 1, MPI_COMM_WORLD, &recv_requests[1]);
    MPI_Isend(send_left.data(),  nx*ny, MPI_DOUBLE, left_rank,  1, MPI_COMM_WORLD, &send_requests[0]);
    MPI_Isend(send_right.data(), nx*ny, MPI_DOUBLE, right_rank, 0, MPI_COMM_WORLD, &send_requests[1]);
}

src/halo.cpp · HaloExchange::finish_exchange()real

void HaloExchange::finish_exchange(std::vector<double>& field) {
    MPI_Waitall(2, recv_requests, MPI_STATUSES_IGNORE);

    // Unpack received planes into the ghost layers (z = 0 and z = nz-1)
    for (int i = 0; i < nx; ++i)
        for (int j = 0; j < ny; ++j)
            field[i * ny * nz + j * nz + 0]        = recv_left[i * ny + j];

    for (int i = 0; i < nx; ++i)
        for (int j = 0; j < ny; ++j)
            field[i * ny * nz + j * nz + (nz-1)] = recv_right[i * ny + j];

    MPI_Waitall(2, send_requests, MPI_STATUSES_IGNORE);
}

Note the crossed tags

The send to left_rank uses tag 1 while the receive from left_rank uses tag 0 — and vice-versa on the right. This tag pairing is what lets a rank's "right send" match its neighbour's "left receive" unambiguously, even though every rank runs the identical code. Neighbours are computed with wrap-around ((rank ± 1 + size) % size), i.e. a periodic 1D process topology.

How it slots into the timestep

The driver calls start_exchange(), computes the interior (which only needs owned cells), then calls finish_exchange() before the next step needs the halos. The same HaloExchange object is reused by both the plain parallel and the cache-blocked CPU backends, and the loop separately times interior vs. halo so you can see communication cost grow with rank count.

Distributed + shared, together. Because the OpenMP pragmas and the MPI calls are guarded by independent MINIW_OMP / MINIW_MPI macros, the very same source compiles to a hybrid build: R ranks (one per node/socket) each running T OpenMP threads. That is the section-5.1 hybrid pattern, with no code rewrite.

6 · Accelerators · CUDA

One cell, one thread

On the GPU the loop nest disappears entirely. Instead of one thread walking millions of cells, we launch millions of threads, each responsible for a single cell. The triple loop becomes three lines of index arithmetic derived from the thread's position in the grid. The + 1 skips the halo, and a bounds check handles grids that don't divide evenly by the block size.

src/stencil_gpu.cu · __global__ stencil_kernelreal

__global__ void stencil_kernel(
    const double* __restrict__ current,
    double* __restrict__ next,
    int nx, int ny, int nz,
    int stride_y, int stride_z)
{
    // 3D thread indexing (+1 skips the halo)
    int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
    int j = blockIdx.y * blockDim.y + threadIdx.y + 1;
    int k = blockIdx.z * blockDim.z + threadIdx.z + 1;

    if (i <= nx && j <= ny && k <= nz) {
        int idx = i * stride_y * stride_z + j * stride_z + k;
        next[idx] = (
            current[(i-1) * stride_y * stride_z + j * stride_z + k] +
            current[(i+1) * stride_y * stride_z + j * stride_z + k] +
            current[i * stride_y * stride_z + (j-1) * stride_z + k] +
            current[i * stride_y * stride_z + (j+1) * stride_z + k] +
            current[i * stride_y * stride_z + j * stride_z + (k-1)] +
            current[i * stride_y * stride_z + j * stride_z + (k+1)] +
            current[idx] * 4.0
        ) / 10.0;
    }
}

The launch configuration

Threads are grouped into 8×8×8 = 512-thread blocks; the grid is sized to cover the whole domain with a ceiling division. Each step launches the kernel, synchronises, and pointer-swaps the device buffers — the GPU analogue of the serial ping-pong.

Coalescing & the shared-memory variant

Because k is the contiguous axis and maps to threadIdx.x's fastest dimension, neighbouring threads read neighbouring addresses — coalesced global loads. A second kernel, stencil_kernel_shared, stages an 10³ tile (8³ interior + 1 halo each side) into __shared__ memory so the six neighbour reads hit on-chip memory instead of re-reading global DRAM.

src/stencil_gpu.cu · grid/block launch + step loopreal

// Kernel configuration
dim3 block(8, 8, 8);
dim3 grid(
    (nx + block.x - 1) / block.x,
    (ny + block.y - 1) / block.y,
    (local_nz + block.z - 1) / block.z
);

for (int step = 0; step < cli.steps; ++step) {
    stencil_kernel<<<grid, block>>>(
        d_current, d_next, nx, ny, local_nz, NY, NZ);
    cudaDeviceSynchronize();
    std::swap(d_current, d_next);   // swap device pointers
}

Honest about the code. The committed CUDA backend selects a device per rank (cudaSetDevice(rank % device_count)) so it is multi-GPU-ready, but the per-step inter-GPU halo exchange is still marked // TODO. Single-GPU and single-node multi-GPU-without-exchange runs are correct today; closing that TODO is the first extension below.

7 · Scaling study

How fast does it actually get?

With all four backends in hand we run a sweep and reduce it to two questions. Strong scaling: fix the problem (here 64³ for 100 steps) and add workers — does it get faster? Weak scaling: grow the problem in lockstep with the workers — does each step stay the same length? The numbers below are representative figures used to illustrate the method (the repo ships the harness — run.sh scale and the submit_scale.slurm script — for collecting your own on the cluster; RESULTS.md is the template you fill in).

Strong scaling — fixed 64³, 100 steps

Representative single-node / small-cluster figures (FP64). Efficiency E = S / N. The 16-rank row shows the classic strong-scaling roll-off: halo surface grows relative to interior volume, so communication starts to dominate.
Configuration	Workers	Time (s)	Speedup S	Efficiency E	GLUPS
Serial baseline	1	12.80	1.00×	100%	0.21
OpenMP (1 node)	4 thr	3.55	3.61×	90%	0.74
OpenMP (1 node)	8 thr	2.05	6.24×	78%	1.28
MPI (1 rank/node)	4 ranks	3.70	3.46×	86%	0.71
MPI	8 ranks	2.15	5.95×	74%	1.22
MPI	16 ranks	1.45	8.83×	55%	1.81
CUDA (1 GPU)	1 GPU	0.42	30.5×	—	6.25

Weak scaling — 64³ per rank

Weak scaling holds up far better than strong scaling: per-rank work is constant, and only the fixed-size halo plane is exchanged, so time/step rises only slowly.
Ranks	Global grid	Time/step (ms)	Weak efficiency
1	64³	36.0	100%
2	64²×128	37.4	96%
4	64²×256	38.9	93%
8	64²×512	40.6	89%

Amdahl's law — the strong-scaling ceiling

If a fraction p of the work is parallelizable and (1−p) is inherently serial, the best speedup on N workers is bounded:

Amdahl $$S(N) = \frac{1}{(1-p) + \dfrac{p}{N}} \;\xrightarrow[N\to\infty]{}\; \frac{1}{1-p}$$

Back-solving from the 8-thread OpenMP row (S ≈ 6.24 at N = 8) gives p ≈ 0.96 — about 4% serial overhead (allocation, the constant-field check, loop ramp-up). That caps the maximum attainable speedup at roughly 1/(1−0.96) ≈ 25× no matter how many cores you add. This is why strong scaling alone always eventually stalls.

Gustafson's law — the weak-scaling escape

Gustafson reframes the question: on a bigger machine you usually solve a bigger problem, not the same one faster. The scaled speedup grows linearly with N:

Gustafson $$S_{\text{scaled}}(N) = N - (1-p)\,(N-1)$$

With the same p ≈ 0.96, eight ranks of weak scaling give a scaled speedup of about 8 − 0.04·7 ≈ 7.7 — comfortably above what Amdahl permits for the fixed-size problem. The 89% weak-efficiency row is the experimental shadow of exactly this law: MiniWeather is a weak-scaling success story, which is precisely why real climate codes scale to thousands of nodes by refining the mesh.

Roofline & arithmetic intensity

Whether any of this is even worth optimizing depends on whether the kernel is compute-bound or memory-bound. The 7-point stencil does 8 floating-point operations per cell update (6 adds + 1 multiply + 1 divide). In the streaming limit it must read the centre and write the result — 2 doubles = 16 bytes of unavoidable DRAM traffic per update (neighbour reads are ideally served from cache). Arithmetic intensity is therefore:

Arithmetic intensity $$I = \frac{\text{FLOPs}}{\text{bytes}} = \frac{8}{16} = 0.5\ \text{FLOP/byte}$$

On the roofline model, performance is min(peak FLOP/s, I × bandwidth). At I = 0.5, even a modest 100 GB/s memory system caps the kernel at ≈ 50 GFLOP/s — far below any modern CPU/GPU's peak FLOP rate. MiniWeather is firmly memory-bandwidth-bound. Three consequences follow directly, and explain every optimization above:

Bandwidth, not FLOPs, is the budget — so cache blocking and SIMD (which raise the effective intensity by reusing neighbour loads from cache) help; squeezing the arithmetic would not.
The GPU wins big — its >1 TB/s HBM bandwidth lifts the roofline by an order of magnitude, which is why the CUDA row is ~30× the serial CPU.
Halo exchange must overlap compute — added bytes on the wire eat directly into the same bandwidth budget that already limits us.

8 · Build & run

From clone to cluster job

Configure & build (CMake)

Feature backends are toggled at configure time, so a single binary miniweather can embed serial, OpenMP, MPI and (optionally) CUDA paths:

buildcmd

# one-shot helper (loads modules, configures, builds)
./run.sh build

# or by hand:
mkdir build && cd build
cmake .. -DENABLE_MPI=ON -DENABLE_OPENMP=ON -DENABLE_CUDA=OFF
make -j$(nproc)

Run locally

runcmd

# serial reference
./build/miniweather --backend serial -nx 64 -ny 64 -nz 64 -t 100

# OpenMP: 8 threads on one node
OMP_NUM_THREADS=8 ./build/miniweather --backend parallel -nx 64 -ny 64 -nz 64 -t 100

# MPI: 4 ranks
mpirun -np 4 ./build/miniweather --backend parallel -nx 64 -ny 64 -nz 64 -t 100

# GPU (requires -DENABLE_CUDA=ON build)
./build/miniweather --backend gpu -nx 128 -ny 128 -nz 128 -t 100

Submit to SLURM

The repo ships ready-made batch scripts under scripts/ and slurm/ (serial, parallel, GPU, strong- and weak-scaling sweeps). The scaling submit script builds on first run and then launches across nodes with srun:

scripts/submit_scale.slurmreal

#!/bin/bash
#SBATCH -J miniweather-scaling-cpu
#SBATCH --output=results/scaling%j.out
#SBATCH -N 4
#SBATCH --ntasks=4
#SBATCH --mem-per-cpu=512M
#SBATCH -t 00:15:00

cd $SLURM_SUBMIT_DIR
module purge
module load openmpi/4.1 gcc/12 cmake

if [ ! -f build/miniweather ]; then
    mkdir -p build && cd build
    cmake .. -DENABLE_MPI=ON -DENABLE_OPENMP=ON
    make -j
    cd ..
fi

echo "Test 3: 16 ranks (4 nodes)"
srun -N 4 -n 16 ./build/miniweather --backend parallel -nx 64 -ny 64 -nz 64 -t 10

Submit it.

sbatch scripts/submit_scale.slurm
sbatch scripts/submit_gpu.slurm
./run.sh scale — sweeps 1·2·4·8 ranks, writes a CSV

The resulting CSVs feed scripts/plot_results.py and scripts/visualize_weather.py, which render the speedup tables and the field images you see on the visualization page.

9 · Mapping to learning outcomes

Where each step lands in the syllabus

Every stage of this project is a concrete instance of a specific course session. The full outline lives on the course-structure page; the mapping below shows the through-line.

Session 2.1

Distributed vs. shared memory

The OpenMP-vs-MPI split is the entire spine of the project: one address space vs. message passing.

Session 2.2

MPI programming

Slab decomposition, non-blocking Isend/Irecv, Waitall, tag pairing, periodic neighbours.

Session 2.3

OpenMP for multithreading

parallel for collapse(2), simd, schedule(dynamic) — and why each was chosen.

Session 2.4

GPU computing / CUDA

The loop nest becomes a 3D thread grid; coalesced loads and a shared-memory tiling variant.

Session 3.1 / 3.3

Optimization & memory hierarchy

Cache blocking, prefetch, data locality, and the roofline / arithmetic-intensity analysis.

Session 3.2

Profiling & performance analysis

Per-phase timers (interior vs. halo), GLUPS, the strong/weak scaling sweep, perf stat.

Session 5.1

Hybrid computing patterns

The macro-guarded source compiles to a hybrid MPI+OpenMP build with no rewrite.

Session 5.5

Climate / terrestrial modelling

MiniWeather is the climate-systems hybrid-programming case study from this session.

10 · Extensions

Where to take it next

Close the multi-GPU TODO

The CUDA backend already does cudaSetDevice(rank % device_count) but skips the inter-GPU halo. Add a device-to-device exchange each step (stage halos to host, or go GPU-aware) to scale across all GPUs on a node and across nodes.

GPU-aware MPI

With a CUDA-aware OpenMPI build you can pass device pointers straight to MPI_Isend/Irecv, letting the runtime use NVLink / GPUDirect RDMA and skip the host bounce entirely — the biggest win for the memory-bound halo.

Tune the hybrid balance

Sweep ranks × threads (e.g. 1×16, 2×8, 4×4) and pin with OMP_PROC_BIND=close + --map-by socket. One rank per NUMA domain usually beats both extremes — section 5.1 made measurable.

2D / 3D decomposition

The current slab cut is 1D in z. A 2D or 3D MPI_Cart_create topology shrinks the surface-to-volume ratio at high rank counts, directly attacking the strong-scaling roll-off seen in the 16-rank row.

Real physics

Swap the toy diffusion operator for the actual MiniWeather advection + buoyancy terms (the visualization already adds wind/buoyancy in 2D) to make it a genuine compressible-flow miniapp.

Parallel I/O & checkpointing

Add MPI-IO or HDF5 to dump fields for visualization and to checkpoint long runs — sessions 3.2 and 3.4 — instead of relying on the in-memory constant-field test.

11 · References & links

Source, course, reading

In this repository

source/src/ — the four backends (serial, blocked, parallel, GPU + halo.cpp)
source/scripts/ & slurm/ — SLURM submit + plotting/visualization helpers
source/docs/ — ARCHITECTURE, RESULTS, SYSTEM, reproduce guide
Interactive visualization — the live 2D analogue of this kernel
Source overview — what every directory contains
Course structure — the full 30-session outline

Background reading

Robey & Zamora (2021), Parallel and High Performance Computing, Manning. ISBN 9781617296468 — OpenMP/MPI/GPU and the tsunami-on-GPUs case study.
Sterling, Anderson, Brodowicz & Bell (2018), High Performance Computing: Modern Systems and Practices, Morgan Kaufmann. ISBN 9780124201583.
Hager & Wellein (2019), Introduction to HPC for Scientists and Engineers, Taylor & Francis. ISBN 9780367221300 — roofline & stencil optimization.
Williams, Waterman & Patterson (2009), "Roofline: an insightful visual performance model," CACM.
Amdahl (1967) & Gustafson (1988) — the two scaling laws used in section 7.
ORNL MiniWeather — the original miniapp this project is modelled on.

Parallelizing MiniWeather:serial → OpenMP → MPI → CUDA

What this project does, and why

Course sessions exercised

Stack

A diffusion stencil and how we cut it up

Domain decomposition

The reference implementation

Threads across the cores of one node

Why collapse(2) and not collapse(3)?

The cache-blocked cousin

Across nodes, with a halo exchange

Note the crossed tags

How it slots into the timestep

One cell, one thread

The launch configuration

Coalescing & the shared-memory variant

How fast does it actually get?

Strong scaling — fixed 64³, 100 steps

Weak scaling — 64³ per rank

Amdahl's law — the strong-scaling ceiling

Gustafson's law — the weak-scaling escape

Roofline & arithmetic intensity

From clone to cluster job

Configure & build (CMake)

Run locally

Submit to SLURM

Where each step lands in the syllabus

Distributed vs. shared memory

MPI programming

OpenMP for multithreading

GPU computing / CUDA

Optimization & memory hierarchy

Profiling & performance analysis

Hybrid computing patterns

Climate / terrestrial modelling

Where to take it next

Close the multi-GPU TODO

GPU-aware MPI

Tune the hybrid balance

2D / 3D decomposition

Real physics

Parallel I/O & checkpointing

Source, course, reading

In this repository

Background reading

Parallelizing MiniWeather:
serial → OpenMP → MPI → CUDA

Why `collapse(2)` and not `collapse(3)`?