hpc-lab · example project — parallelising a scientific workload end to end

Example project — parallelising a scientific workload, end to end

This page is a worked, start-to-finish HPC case study: it takes one concrete scientific kernel and carries it through every stage the syllabus's group project asks for — "design, implementation, and optimization of HPC solutions for real-world problems". We pick a problem that is relatable, physically meaningful, and structurally rich enough to exercise the whole course: a 2-D heat-diffusion simulation solved with a 5-point Jacobi stencil on a regular grid.

The kernel is ideal because it is embarrassingly parallel at the level of a single cell, yet has real communication structure the moment you split the grid across processors: each tile needs a one-cell-thick border of its neighbours' data (the halo) every step. That single fact is what turns a toy into a genuine HPC problem — it forces us through Amdahl's law, OpenMP threading, MPI domain decomposition and halo exchange, GPU offload, the roofline model, and even the distributed-ML module, because a gradient all-reduce is the same communication pattern as a halo exchange. Every stage below cross-links to the matching live demo on the demos page.

kernel

2-D Jacobi

grid

N × N = 4096²

baseline

Python / NumPy

shared mem

OpenMP

distributed

MPI

accelerator

CUDA

bottleneck

memory-bound

maps to

whole syllabus

The nine stages of the project, each a section below — with its live demo anchor:

1 · problem & baseline 2 · is it worth it? (Amdahl) 3 · shared memory (OpenMP) 4 · distributed (MPI + halo) 5 · GPU offload 6 · performance analysis 7 · robustness at scale 8 · scaling out to ML 9 · results & reflection

A note on the numbers. The scaling figures in this study are modelled / expected, derived from the kernel's arithmetic intensity and well-known scaling laws — not measurements from a real cluster run. They are internally consistent and chosen to be representative of what a memory-bound stencil actually does, which is exactly the kind of analysis the project deliverable rewards.

1 · Problem & baseline

Heat spreads through a plate over time. The governing PDE is the heat equation, which says the rate of change of temperature at a point is proportional to how curved the temperature field is there (its Laplacian):

$$ \frac{\partial u}{\partial t} = \alpha \nabla^2 u = \alpha\!\left(\frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}\right) $$

We discretise space into an $N \times N$ grid with spacing $h$ and march in time with step $\Delta t$. Replacing each second derivative with the standard central difference and choosing the timestep so the scheme is the simplest stable explicit one ($\alpha\,\Delta t / h^2 = \tfrac14$) collapses the update to a clean average of the four orthogonal neighbours — the 5-point Jacobi stencil:

$$ u^{t+1}_{i,j} = \tfrac14\left(u^{t}_{i-1,j} + u^{t}_{i+1,j} + u^{t}_{i,j-1} + u^{t}_{i,j+1}\right) $$

Every cell at step $t{+}1$ reads only the previous step's values, so the entire grid can be updated independently — the source of all the parallelism we are about to exploit. We keep two buffers and swap them (ping-pong) so reads and writes never alias. Convergence is declared when the largest change between steps, the residual $r = \max_{i,j}\,|u^{t+1}_{i,j} - u^{t}_{i,j}|$, falls below a tolerance.

jacobi.py — serial NumPy baseline

import numpy as np

def jacobi(u, tol=1e-4, max_iter=20000):
    # u: (N, N) float64 grid; fixed Dirichlet values on the border.
    new = u.copy()
    for it in range(max_iter):
        # vectorised 5-point average of the interior cells
        new[1:-1, 1:-1] = 0.25 * (
            u[:-2, 1:-1] + u[2:, 1:-1] +     # north + south
            u[1:-1, :-2] + u[1:-1, 2:])      # west  + east
        res = np.max(np.abs(new[1:-1, 1:-1] - u[1:-1, 1:-1]))
        u, new = new, u                      # ping-pong the buffers
        if res < tol:
            break
    return u, it, res

Where the time goes. Before optimising anything, profile. Under cProfile the whole runtime sits inside jacobi(); switching to line_profiler (the @profile decorator + kernprof -l) attributes it to a single line — the stencil average. That line is the hot loop: roughly 96–98% of wall-clock for a large grid. Everything else (the residual reduction, the buffer swap, the convergence test) is the serial remainder.

$ kernprof -l -v jacobi.py
Line #   % Time  Line Contents
=========================================
    7     0.4    new = u.copy()
    9    96.8    new[1:-1,1:-1] = 0.25 * ( ... )   # <-- the hot loop
   12     2.4    res = np.max(np.abs(new - u))
   13     0.3    u, new = new, u
   14     0.1    if res < tol: break

The arithmetic per cell is tiny — 4 additions and 1 multiply — but each cell touches 4 neighbours plus itself. That ratio of flops to bytes moved is what will make this kernel memory-bound, a conclusion we will pin down precisely in the performance analysis via the roofline demo.

work

The 5-point stencil sweep over all interior cells — the parallelisable 96.8%.

reduce

Global max-residual over the grid — a reduction, becomes a collective under MPI.

control

Buffer swap + convergence test — the inherently serial remainder.

2 · Is it worth parallelising?

Never parallelise on faith — estimate the ceiling first. Of the total work, the fraction that can run in parallel is $p$; the rest, $s = 1 - p$, is stubbornly serial. From the profile the stencil sweep is the parallel part. Being conservative and folding the residual reduction's setup into the serial share, take $p = 0.97$, so $s = 0.03$. Amdahl's law bounds the speedup on $P$ processors when the problem size is fixed:

$$ S_{\text{Amdahl}}(P) = \frac{1}{s + \dfrac{p}{P}} = \frac{1}{(1-p) + \dfrac{p}{P}} \qquad\xrightarrow[P\to\infty]{}\qquad \frac{1}{s} $$

worked example Amdahl ceiling at $p = 0.97$

Plug $s = 0.03$, $p = 0.97$ into $S(P) = 1/(0.03 + 0.97/P)$ for a fixed $4096^2$ grid:

P (cores)	denominator	speedup S	efficiency S/P
8	0.03 + 0.121 = 0.151	6.6×	83%
16	0.03 + 0.061 = 0.091	11.0×	69%
64	0.03 + 0.015 = 0.045	22.1×	35%
256	0.03 + 0.0038 = 0.0338	29.6×	12%
∞	0.03	33.3×	→ 0%

Even with infinite cores a 3% serial fraction caps us at 33.3×. By 64 cores we are already at one-third efficiency — the serial residual test is becoming the wall. Takeaway: shaving the serial part (a parallel reduction instead of a global one) matters more than adding cores.

But the picture changes if the problem grows. In real HPC we rarely keep the grid fixed — we use more cores to solve a bigger problem in the same time. That is Gustafson's law, which measures scaled speedup and is far more optimistic:

$$ S_{\text{Gustafson}}(P) = s + p\cdot P = P - s\,(P - 1) $$

worked example Gustafson scaled speedup at $s = 0.03$

With $s = 0.03$, $p = 0.97$: $S(P) = 0.03 + 0.97P$. Each core handles its own tile, so the parallel work grows with $P$ while the serial part stays roughly constant.

P (cores)	0.03 + 0.97·P	scaled speedup	efficiency
8	0.03 + 7.76	7.8×	97%
64	0.03 + 62.1	62.1×	97%
256	0.03 + 248.3	248.3×	97%

Under weak scaling the efficiency holds near 97% all the way up. Reconciliation: Amdahl answers "faster on the same grid?" (limited), Gustafson answers "bigger grid in the same time?" (scales). Our stencil is a Gustafson workload — we grow $N$ with the core count, so the project targets weak scaling.

Decision A 97% parallel kernel is worth parallelising — but only if we also keep the problem growing. We will report both a strong-scaling table (Amdahl regime, fixed grid) and a weak-scaling table (Gustafson regime, grid grows with cores) in section 6. Explore the curves in the Amdahl / Gustafson demo.

3 · Shared-memory parallelism with OpenMP

The first real implementation step targets a single multi-core node. All threads share the same grid in memory, so we do not move data — we just divide the rows of the sweep among threads. In C, one #pragma omp parallel for over the outer loop expresses the whole thing; OpenMP forks a team of threads, splits the iteration space, and joins at the end of the loop. This is the SIMD/SPMD idea: many threads, one instruction stream, different data. Compare shared-memory vs message-passing models in the Flynn's taxonomy demo.

jacobi_omp.c — threaded stencil sweep

#pragma omp parallel for collapse(2) schedule(static) \
        reduction(max:res)
for (int i = 1; i < N-1; i++) {
    for (int j = 1; j < N-1; j++) {
        double v = 0.25 * (u[i-1][j] + u[i+1][j] +
                            u[i][j-1] + u[i][j+1]);
        double d = fabs(v - u[i][j]);
        if (d > res) res = d;          /* combined by reduction(max) */
        new[i][j] = v;
    }
}
/* swap u and new pointers, then test res < tol */

Scheduling. For a uniform stencil every row costs the same, so schedule(static) — a fixed, contiguous block of rows per thread — is exactly right: it has zero scheduling overhead and keeps each thread working on a contiguous, cache-friendly slab. schedule(dynamic) would only pay off if cell costs were irregular (e.g. an adaptive grid); here its per-chunk bookkeeping is pure overhead.

worked example Static load balance, 4096 rows on 8 threads

The sweep updates $N - 2 = 4094$ interior rows. schedule(static) hands each of 8 threads $\lfloor 4094/8 \rfloor = 511$ rows, with the first $4094 \bmod 8 = 6$ threads taking one extra.

threads 0–5: 512 rows each → 6 × 512 = 3072 rows
threads 6–7: 511 rows each → 2 × 511 = 1022 rows
total: 3072 + 1022 = 4094 rows ✓, max imbalance = 1 row = 0.2%

Imbalance under 1% means the static split is essentially perfect for this regular grid — no reason to reach for dynamic scheduling. The slowest thread sets the step time, so 0.2% imbalance costs us at most 0.2% of the parallel speedup.

Two shared-memory hazards the project must address explicitly:

False sharing. Two threads writing to different cells that happen to fall on the same 64-byte cache line force that line to bounce between cores. Because we partition by whole rows (large contiguous blocks), the shared boundary is just one cache line per thread pair — negligible. Splitting by column would have been a disaster.
NUMA & first-touch. On a multi-socket node, physical pages are placed on the socket of the thread that first writes them. So we initialise the grid with the same parallel for partitioning we use for the sweep — each thread first-touches the rows it will own, placing that memory on its local socket and avoiding slow cross-socket reads every step.

Result On one 8-core socket we expect a near-linear ~7× step-time speedup (matching the Amdahl 8-core figure of 6.6× once memory bandwidth is accounted for). See the OpenMP threads demo.

4 · Distributed-memory parallelism with MPI

One node runs out of cores and memory eventually. To use a whole cluster, each process owns a private slice of the grid in its own address space and the only way to share data is to send messages. We split the $N \times N$ grid by domain decomposition into a $\sqrt{P} \times \sqrt{P}$ array of tiles, one per MPI rank. A rank can update its interior alone — but the cells on a tile's edge need the neighbouring tile's edge values, which live in another process's memory.

Halo / ghost cells. Each tile is padded with a one-cell-thick border — the halo — that holds a copy of the neighbour's edge. Before every Jacobi step, neighbours exchange these borders. After the exchange, every cell has all four neighbours locally and the sweep proceeds exactly as in the serial code. This is the heartbeat of the program.

jacobi_mpi.c — one timestep on each rank

/* 1. exchange halos with the 4 cartesian neighbours (non-blocking) */
MPI_Irecv(halo_top,    N_loc, MPI_DOUBLE, nbr_up,   0, comm, &r[0]);
MPI_Irecv(halo_bottom, N_loc, MPI_DOUBLE, nbr_down, 1, comm, &r[1]);
MPI_Isend(row_first,   N_loc, MPI_DOUBLE, nbr_up,   1, comm, &r[2]);
MPI_Isend(row_last,    N_loc, MPI_DOUBLE, nbr_down, 0, comm, &r[3]);
/* ... same for left/right neighbours ... */
MPI_Waitall(8, r, MPI_STATUSES_IGNORE);

/* 2. local Jacobi sweep over this rank's tile (halos now valid) */
double res_local = stencil_sweep(u, new, n_loc, m_loc);

/* 3. global convergence: combine every rank's residual */
MPI_Allreduce(&res_local, &res_global, 1, MPI_DOUBLE, MPI_MAX, comm);
if (res_global < tol) break;

Two communication costs dominate, and both are worth modelling with the standard $\alpha$–$\beta$ model where $\alpha$ is per-message latency and $\beta$ is per-byte inverse bandwidth:

Halo exchange — each rank sends 4 borders of $\sqrt{N^2/P}$ doubles to fixed neighbours. Cost per step $\approx 4(\alpha + \beta\cdot 8\sqrt{N^2/P})$. Crucially this is independent of $P$ for a fixed tile size — it is nearest-neighbour, so it scales.
Allreduce — the global residual is one collective over all $P$ ranks. A naïve linear reduction costs $O(P)$ messages; a tree-based reduction costs $O(\log_2 P)$. MPI implementations use the tree, which is why the collective stays cheap at scale.

worked example Tree vs linear Allreduce at $P = 1024$

A scalar MPI_Allreduce over $P$ ranks. Count the communication rounds (each round $\approx \alpha$ latency) for the linear and the tree algorithms.

P	linear (P−1)	tree (2·⌈log₂P⌉)	speedup
16	15	8	1.9×
64	63	12	5.3×
1024	1023	20	51×

At 1024 ranks the tree does its reduce-then-broadcast in $2\lceil\log_2 1024\rceil = 20$ latency-bound rounds versus 1023 for the linear chain — a 51× reduction in latency cost. This is why the residual test, which Amdahl flagged as the serial bottleneck, must use a collective, never a hand-rolled gather-to-rank-0.

Overlapping communication and computation. The non-blocking MPI_Isend/Irecv above let us start the halo exchange, immediately update the tile's interior (which doesn't need halos) while messages are in flight, then update the thin boundary ring once MPI_Waitall returns. Done well this hides almost all communication latency behind useful work. Walk the exchange step-by-step in the MPI halo-exchange demo; the message-passing model itself is the right column of the Flynn taxonomy demo.

5 · GPU offload

A stencil is a near-perfect fit for a GPU: thousands of identical, independent cell updates map directly onto thousands of hardware threads. We launch one CUDA thread per interior cell, grouped into 2-D thread blocks (e.g. $16 \times 16 = 256$ threads) tiled across the grid. The GPU runs threads in lockstep warps of 32 under the SIMT (Single Instruction, Multiple Thread) model — the GPU corner of Flynn's taxonomy.

jacobi.cu — one thread per cell

__global__ void jacobi_step(const double* u, double* out, int N) {
    int j = blockIdx.x * blockDim.x + threadIdx.x;   // column
    int i = blockIdx.y * blockDim.y + threadIdx.y;   // row
    if (i > 0 && i < N-1 && j > 0 && j < N-1) {
        out[i*N + j] = 0.25 * ( u[(i-1)*N + j] + u[(i+1)*N + j]
                              + u[i*N + (j-1)] + u[i*N + (j+1)] );
    }
}
// launch: dim3 block(16,16); dim3 grid((N+15)/16, (N+15)/16);
//         jacobi_step<<<grid, block>>>(d_u, d_out, N);

Two properties decide whether the GPU actually goes fast:

Coalesced memory access. Consecutive threads in a warp (consecutive j) read consecutive addresses u[i*N + j], so the hardware merges 32 reads into a few wide memory transactions. Storing the grid row-major and mapping threadIdx.x to the column is what makes the access coalesced — column-major mapping would scatter the reads and tank bandwidth.
Why it is still memory-bound. Each cell does ~5 flops but loads 5 doubles (40 bytes). The GPU has enormous compute throughput but the kernel can't feed it — performance is capped by memory bandwidth, not the ALUs. Using shared memory to stage each block's tile (so the 4 neighbour reads hit fast on-chip memory instead of global memory) is the standard optimisation, raising the effective arithmetic intensity.

Same conclusion, different hardware CPU or GPU, the stencil's low flop-to-byte ratio means bandwidth, not arithmetic, is the ceiling — which is exactly what the roofline analysis next will quantify. Explore the launch geometry in the GPU / SIMT demo. The GPU is today's mainstream accelerator; the same "specialised hardware for a specific kernel shape" logic is what the field is now exploring with quantum accelerators for problems no classical stencil can express.

6 · Performance analysis

Implementation is half the project; the other half is measuring and explaining what you built. We report strong scaling (fixed problem, more cores), weak scaling (problem grows with cores), the Karp–Flatt experimental serial fraction, and where the kernel lands on the roofline. The tables below are modelled from the $p = 0.97$ kernel and a representative node memory bandwidth.

Strong scaling — fixed 4096² grid

Same grid, more cores. Speedup $S(P) = T_1/T_P$, efficiency $E = S/P$. This is the Amdahl regime, so $E$ falls as the serial residual test starts to dominate.

cores P	time T (s)	speedup S	efficiency E
1	240.0	1.0×	100%
2	123.7	1.94×	97%
4	65.2	3.68×	92%
8	36.4	6.59×	82%
16	21.8	11.0×	69%
32	14.7	16.3×	51%
64	10.9	22.0×	34%

Weak scaling — 4096² cells per core

Grid grows with $P$ so each core always holds a $4096^2$ tile. Ideal weak scaling keeps step time flat; the slow creep here is the $O(\log P)$ Allreduce plus halo cost.

cores P	grid	step time (ms)	weak efficiency
1	4096²	52.0	100%
4	8192²	53.1	98%
16	16384²	54.4	96%
64	32768²	56.2	93%
256	65536²	58.6	89%

worked example Karp–Flatt experimental serial fraction

The Karp–Flatt metric recovers the real serial fraction $e$ from measured speedup, capturing overheads Amdahl's idealised $s$ ignores: $\displaystyle e = \frac{1/S - 1/P}{1 - 1/P}$. Use the strong-scaling row at $P = 8$, $S = 6.59$:

numerator: $1/S - 1/P = 1/6.59 - 1/8 = 0.1517 - 0.1250 = 0.0267$
denominator: $1 - 1/P = 1 - 0.125 = 0.875$
result: $e = 0.0267 / 0.875 = \mathbf{0.0305}$ (≈ 3.0%)
check at $P = 64$, $S = 22.0$: $e = (0.04545 - 0.015625)/0.9844 = \mathbf{0.0303}$

$e \approx 0.030$ stays flat as $P$ grows, which is the signature of a genuine serial fraction (our residual test) rather than parallel-overhead growth. If $e$ had risen with $P$, the lost efficiency would point to communication overhead instead — the diagnostic value of Karp–Flatt. Compare both tables live in the scaling-laws demo.

worked example Roofline — arithmetic intensity of the stencil

Arithmetic intensity $I$ = flops ÷ bytes moved from DRAM. Per interior cell the 5-point Jacobi does 4 adds + 1 multiply = 5 flops. With double precision and the unavoidable streaming traffic (read the cell's source values + write the result; with good caching the four neighbours are reused from adjacent cells, so the irreducible traffic is ~1 read + 1 write = 16 bytes):

flops/cell: 5
bytes/cell (best case, cached neighbours): 8 (read) + 8 (write) = 16 B
intensity: $I = 5 / 16 = \mathbf{0.31}$ flop/byte (naïve, uncached: $5/40 = 0.125$)
machine balance: a node with 200 GFLOP/s and 100 GB/s has ridge point $I^* = 200/100 = 2.0$ flop/byte

Since $I = 0.31 \ll I^* = 2.0$, the kernel sits firmly on the bandwidth-bound slope of the roofline, far left of the ridge. Attainable performance $= I \times \text{BW} = 0.31 \times 100 = \mathbf{31}$ GFLOP/s — only ~16% of peak, and no amount of faster arithmetic helps. The lever is bandwidth: cache blocking, shared-memory tiling on the GPU, or fusing timesteps to raise $I$. See the roofline demo and the memory-hierarchy demo.

Diagnosis Strong scaling is capped by a flat 3% serial fraction (Karp–Flatt); weak scaling holds above 89% to 256 cores; and the roofline proves the per-core ceiling is memory bandwidth, not flops. The single most valuable optimisation is therefore reducing memory traffic, not adding arithmetic units.

7 · Robustness at scale

A simulation that runs for hours on hundreds of nodes will meet a node failure, a queue time limit, or a full disk. Production HPC code is judged as much on surviving these as on raw speed — two mechanisms make the project robust.

Checkpoint / restart. Every $K$ steps each rank writes its tile and the current iteration number to disk. On restart the program detects the latest complete checkpoint and resumes from it instead of step zero. The cost is a tunable trade-off: checkpoint too often and I/O dominates; too rarely and a crash wastes hours of compute. Young's formula gives the optimal interval $\tau_{\text{opt}} = \sqrt{2\,\delta\,M}$, where $\delta$ is the time to write one checkpoint and $M$ is the mean time between failures.
Parallel I/O. Having all ranks funnel their data to rank 0 to write a single file re-introduces a serial bottleneck — the very thing Amdahl warned about. Instead every rank writes its own region of one shared file concurrently with MPI_File_write_at_all (MPI-IO), so write bandwidth scales with the parallel filesystem (e.g. Lustre) rather than a single stream.

worked example Young's optimal checkpoint interval

A run on 256 nodes has a per-node MTBF of 5 years, so the system MTBF is $M = 5\text{ yr} / 256 \approx 6.4 \times 10^{5}$ s ≈ 178 h. A full checkpoint takes $\delta = 60$ s. Then $\tau_{\text{opt}} = \sqrt{2 \delta M}$:

M in seconds: $5 \times 365 \times 24 \times 3600 / 256 \approx 6.16 \times 10^{5}$ s
2·δ·M: $2 \times 60 \times 6.16\times10^{5} = 7.39 \times 10^{7}$
τ_opt: $\sqrt{7.39\times10^{7}} \approx 8597$ s ≈ 2.4 hours

So checkpoint roughly every 2.4 h, not every few minutes. As the job scales to more nodes the system MTBF shrinks ($M \propto 1/\text{nodes}$), so $\tau_{\text{opt}}$ shrinks too — at extreme scale checkpointing becomes a first-order design concern, not an afterthought.

8 · Scaling out to distributed ML

The syllabus extends HPC into distributed machine learning, and the bridge is tighter than it looks: training a large neural network is the same communication problem as our halo exchange, dressed in different vocabulary. Two parallel strategies mirror the two ways we split the stencil.

Data parallelism ≈ domain decomposition. Every worker holds a full copy of the model but processes a different shard of the training batch — exactly as every MPI rank ran the identical stencil on a different tile. Each worker computes gradients locally, then all workers average them with a gradient all-reduce before stepping. That all-reduce is the same MPI_Allreduce (typically ring- or tree-based) we used for the global residual — only now the reduced quantity is a vector of gradients instead of one scalar.
Model parallelism ≈ halo exchange. When a model is too big for one device it is split across devices, and the activations at each split boundary must be sent to the next device every forward/backward pass — a nearest-neighbour exchange of boundary data, structurally identical to passing halo rows between neighbouring tiles.

The same analysis tools carry over wholesale. Amdahl/Gustafson bound how much faster training gets with more GPUs; the $\alpha$–$\beta$ model predicts whether the gradient all-reduce (which grows with model size) will dominate; and the roofline tells you whether a given layer is compute- or memory-bound. A student who understands the Jacobi stencil already understands the skeleton of distributed SGD.

The unifying idea Computation is local; correctness is global. Whether the global quantity is a convergence residual, a reduced gradient, or a boundary activation, the cost of agreeing on it across processors is what governs how far any HPC workload — physics or ML — can scale. Relate it back to the MapReduce demo: map = local stencil sweep, reduce = the all-reduce.

9 · Results & reflection

A sample modelled run, end to end — serial baseline, then the same $4096^2$ grid on 8 OpenMP threads, then a $32768^2$ weak-scaled grid on 64 MPI ranks across 8 nodes:

$ python jacobi.py            # serial baseline, 4096^2
converged in 1840 iters, residual 9.7e-05   wall 240.1 s

$ OMP_NUM_THREADS=8 ./jacobi_omp   # 1 node, 8 cores, 4096^2
converged in 1840 iters, residual 9.7e-05   wall 36.4 s  (6.6x)

$ mpirun -np 64 ./jacobi_mpi       # 8 nodes, 32768^2 weak scaled
grid 32768x32768  tiles 8x8  halo 8/step
converged in 1841 iters, residual 9.9e-05   step 56.2 ms  weak-eff 93%

# strong-scaling on the fixed grid, summarised:
$ cat scaling.txt
P=1  240.0s  1.0x  |  P=8  36.4s  6.6x  |  P=64  10.9s  22.0x

What worked. The kernel parallelised cleanly: OpenMP delivered a near-linear 6.6× on one socket with a 0.2% load imbalance, and MPI weak-scaling held above 89% to 256 cores because the communication is nearest-neighbour (halo) plus a logarithmic collective (Allreduce). Tree reductions and non-blocking, overlapped halo exchange kept the serial fraction pinned at the ~3% Karp–Flatt floor.

The bottleneck. The roofline analysis was decisive: at $I = 0.31$ flop/byte the stencil is memory-bandwidth-bound, reaching only ~16% of compute peak. Faster arithmetic is wasted; the real wins come from cache blocking and on-chip tiling that cut DRAM traffic. The secondary limit is the global residual test — a true serial fraction that caps strong scaling at ~33× no matter the core count.

Key idea A successful HPC project is not "make it use all the cores" — it is a chain of evidence: profile to find the hot loop, model (Amdahl/Gustafson) to set expectations, implement at the right level (OpenMP → MPI → GPU), measure (strong/weak scaling, Karp–Flatt), and explain the ceiling (roofline). For this stencil every link in that chain pointed at the same culprit — memory bandwidth — and that diagnosis, not the raw speedup number, is the deliverable.

Mapping project stages to learning outcomes

How each stage of the build demonstrates a syllabus outcome and which live demo animates it:

Project stage	Learning outcome demonstrated	Demo
Problem & baseline	Model a real problem; profile to find the hot loop	roofline, memory
Amdahl / Gustafson	Quantify the speedup ceiling before optimising	amdahl, scaling
OpenMP threading	Shared-memory parallelism, scheduling, NUMA, false sharing	openmp, flynn
MPI + halo exchange	Distributed memory, decomposition, collectives, comm cost	mpi
GPU offload	SIMT, coalescing, accelerator memory model	gpu
Performance analysis	Strong/weak scaling, Karp–Flatt, roofline placement	scaling, roofline
Robustness at scale	Checkpoint/restart, parallel I/O, fault tolerance	—
Scaling out to ML	Data vs model parallelism; gradient all-reduce	mapreduce

design implementation optimization analysis full course outline →

References

Introduction to High Performance Computing for Scientists and Engineers

Georg Hager & Gerhard Wellein (2010). CRC Press. ISBN 9781439811924.

The roofline model, stencil performance, NUMA/first-touch and the memory-bound analysis used throughout sections 3, 5 and 6 trace directly to this text.
Using MPI: Portable Parallel Programming with the Message-Passing Interface, 3rd ed.

William Gropp, Ewing Lusk & Anthony Skjellum (2014). MIT Press. ISBN 9780262527392.

Domain decomposition, ghost/halo exchange with non-blocking sends, and collective (Allreduce) cost models underpin section 4.
Programming Massively Parallel Processors, 4th ed.

Wen-mei Hwu, David Kirk & Izzat El Hajj (2022). Morgan Kaufmann. ISBN 9780323912310.

SIMT execution, memory coalescing and shared-memory tiling of stencils inform the GPU offload in section 5.
Roofline: An Insightful Visual Performance Model for Multicore Architectures

Samuel Williams, Andrew Waterman & David Patterson (2009). Comm. ACM 52(4), 65–76.

The arithmetic-intensity / ridge-point analysis applied to the Jacobi kernel in section 6.
hpc-lab — interactive demos & course outline

This site.

The interactive demos animate Amdahl/Gustafson, scaling, roofline, the memory hierarchy, Flynn's taxonomy, OpenMP, MPI, GPU and MapReduce; the course outline maps every session.