hpc-lab example project · parallelising a 2-D stencil end to end

Example project — parallelising a scientific workload, end to end

This page is a worked, start-to-finish HPC case study: it takes one concrete scientific kernel and carries it through every stage the syllabus's group project asks for — "design, implementation, and optimization of HPC solutions for real-world problems". We pick a problem that is relatable, physically meaningful, and structurally rich enough to exercise the whole course: a 2-D heat-diffusion simulation solved with a 5-point Jacobi stencil on a regular grid.

The kernel is ideal because it is embarrassingly parallel at the level of a single cell, yet has real communication structure the moment you split the grid across processors: each tile needs a one-cell-thick border of its neighbours' data (the halo) every step. That single fact is what turns a toy into a genuine HPC problem — it forces us through Amdahl's law, OpenMP threading, MPI domain decomposition and halo exchange, GPU offload, the roofline model, and even the distributed-ML module, because a gradient all-reduce is the same communication pattern as a halo exchange. Every stage below cross-links to the matching live demo on the demos page.

kernel
2-D Jacobi
grid
N × N = 4096²
baseline
Python / NumPy
shared mem
OpenMP
distributed
MPI
accelerator
CUDA
bottleneck
memory-bound
maps to
whole syllabus

The nine stages of the project, each a section below — with its live demo anchor:

A note on the numbers. The scaling figures in this study are modelled / expected, derived from the kernel's arithmetic intensity and well-known scaling laws — not measurements from a real cluster run. They are internally consistent and chosen to be representative of what a memory-bound stencil actually does, which is exactly the kind of analysis the project deliverable rewards.

1 · Problem & baseline

Heat spreads through a plate over time. The governing PDE is the heat equation, which says the rate of change of temperature at a point is proportional to how curved the temperature field is there (its Laplacian):

$$ \frac{\partial u}{\partial t} = \alpha \nabla^2 u = \alpha\!\left(\frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}\right) $$

We discretise space into an $N \times N$ grid with spacing $h$ and march in time with step $\Delta t$. Replacing each second derivative with the standard central difference and choosing the timestep so the scheme is the simplest stable explicit one ($\alpha\,\Delta t / h^2 = \tfrac14$) collapses the update to a clean average of the four orthogonal neighbours — the 5-point Jacobi stencil:

$$ u^{t+1}_{i,j} = \tfrac14\left(u^{t}_{i-1,j} + u^{t}_{i+1,j} + u^{t}_{i,j-1} + u^{t}_{i,j+1}\right) $$

Every cell at step $t{+}1$ reads only the previous step's values, so the entire grid can be updated independently — the source of all the parallelism we are about to exploit. We keep two buffers and swap them (ping-pong) so reads and writes never alias. Convergence is declared when the largest change between steps, the residual $r = \max_{i,j}\,|u^{t+1}_{i,j} - u^{t}_{i,j}|$, falls below a tolerance.

jacobi.py — serial NumPy baseline

import numpy as np

def jacobi(u, tol=1e-4, max_iter=20000):
    # u: (N, N) float64 grid; fixed Dirichlet values on the border.
    new = u.copy()
    for it in range(max_iter):
        # vectorised 5-point average of the interior cells
        new[1:-1, 1:-1] = 0.25 * (
            u[:-2, 1:-1] + u[2:, 1:-1] +     # north + south
            u[1:-1, :-2] + u[1:-1, 2:])      # west  + east
        res = np.max(np.abs(new[1:-1, 1:-1] - u[1:-1, 1:-1]))
        u, new = new, u                      # ping-pong the buffers
        if res < tol:
            break
    return u, it, res

Where the time goes. Before optimising anything, profile. Under cProfile the whole runtime sits inside jacobi(); switching to line_profiler (the @profile decorator + kernprof -l) attributes it to a single line — the stencil average. That line is the hot loop: roughly 96–98% of wall-clock for a large grid. Everything else (the residual reduction, the buffer swap, the convergence test) is the serial remainder.

$ kernprof -l -v jacobi.py
Line #   % Time  Line Contents
=========================================
    7     0.4    new = u.copy()
    9    96.8    new[1:-1,1:-1] = 0.25 * ( ... )   # <-- the hot loop
   12     2.4    res = np.max(np.abs(new - u))
   13     0.3    u, new = new, u
   14     0.1    if res < tol: break

The arithmetic per cell is tiny — 4 additions and 1 multiply — but each cell touches 4 neighbours plus itself. That ratio of flops to bytes moved is what will make this kernel memory-bound, a conclusion we will pin down precisely in the performance analysis via the roofline demo.

work
The 5-point stencil sweep over all interior cells — the parallelisable 96.8%.
reduce
Global max-residual over the grid — a reduction, becomes a collective under MPI.
control
Buffer swap + convergence test — the inherently serial remainder.

2 · Is it worth parallelising?

Never parallelise on faith — estimate the ceiling first. Of the total work, the fraction that can run in parallel is $p$; the rest, $s = 1 - p$, is stubbornly serial. From the profile the stencil sweep is the parallel part. Being conservative and folding the residual reduction's setup into the serial share, take $p = 0.97$, so $s = 0.03$. Amdahl's law bounds the speedup on $P$ processors when the problem size is fixed:

$$ S_{\text{Amdahl}}(P) = \frac{1}{s + \dfrac{p}{P}} = \frac{1}{(1-p) + \dfrac{p}{P}} \qquad\xrightarrow[P\to\infty]{}\qquad \frac{1}{s} $$

worked example Amdahl ceiling at $p = 0.97$

Plug $s = 0.03$, $p = 0.97$ into $S(P) = 1/(0.03 + 0.97/P)$ for a fixed $4096^2$ grid:

P (cores)denominatorspeedup Sefficiency S/P
80.03 + 0.121 = 0.1516.6×83%
160.03 + 0.061 = 0.09111.0×69%
640.03 + 0.015 = 0.04522.1×35%
2560.03 + 0.0038 = 0.033829.6×12%
0.0333.3×→ 0%

Even with infinite cores a 3% serial fraction caps us at 33.3×. By 64 cores we are already at one-third efficiency — the serial residual test is becoming the wall. Takeaway: shaving the serial part (a parallel reduction instead of a global one) matters more than adding cores.

But the picture changes if the problem grows. In real HPC we rarely keep the grid fixed — we use more cores to solve a bigger problem in the same time. That is Gustafson's law, which measures scaled speedup and is far more optimistic:

$$ S_{\text{Gustafson}}(P) = s + p\cdot P = P - s\,(P - 1) $$

worked example Gustafson scaled speedup at $s = 0.03$

With $s = 0.03$, $p = 0.97$: $S(P) = 0.03 + 0.97P$. Each core handles its own tile, so the parallel work grows with $P$ while the serial part stays roughly constant.

P (cores)0.03 + 0.97·Pscaled speedupefficiency
80.03 + 7.767.8×97%
640.03 + 62.162.1×97%
2560.03 + 248.3248.3×97%

Under weak scaling the efficiency holds near 97% all the way up. Reconciliation: Amdahl answers "faster on the same grid?" (limited), Gustafson answers "bigger grid in the same time?" (scales). Our stencil is a Gustafson workload — we grow $N$ with the core count, so the project targets weak scaling.

Decision  A 97% parallel kernel is worth parallelising — but only if we also keep the problem growing. We will report both a strong-scaling table (Amdahl regime, fixed grid) and a weak-scaling table (Gustafson regime, grid grows with cores) in section 6. Explore the curves in the Amdahl / Gustafson demo.

3 · Shared-memory parallelism with OpenMP

The first real implementation step targets a single multi-core node. All threads share the same grid in memory, so we do not move data — we just divide the rows of the sweep among threads. In C, one #pragma omp parallel for over the outer loop expresses the whole thing; OpenMP forks a team of threads, splits the iteration space, and joins at the end of the loop. This is the SIMD/SPMD idea: many threads, one instruction stream, different data. Compare shared-memory vs message-passing models in the Flynn's taxonomy demo.

jacobi_omp.c — threaded stencil sweep

#pragma omp parallel for collapse(2) schedule(static) \
        reduction(max:res)
for (int i = 1; i < N-1; i++) {
    for (int j = 1; j < N-1; j++) {
        double v = 0.25 * (u[i-1][j] + u[i+1][j] +
                            u[i][j-1] + u[i][j+1]);
        double d = fabs(v - u[i][j]);
        if (d > res) res = d;          /* combined by reduction(max) */
        new[i][j] = v;
    }
}
/* swap u and new pointers, then test res < tol */

Scheduling. For a uniform stencil every row costs the same, so schedule(static) — a fixed, contiguous block of rows per thread — is exactly right: it has zero scheduling overhead and keeps each thread working on a contiguous, cache-friendly slab. schedule(dynamic) would only pay off if cell costs were irregular (e.g. an adaptive grid); here its per-chunk bookkeeping is pure overhead.

worked example Static load balance, 4096 rows on 8 threads

The sweep updates $N - 2 = 4094$ interior rows. schedule(static) hands each of 8 threads $\lfloor 4094/8 \rfloor = 511$ rows, with the first $4094 \bmod 8 = 6$ threads taking one extra.

  1. threads 0–5: 512 rows each  →  6 × 512 = 3072 rows
  2. threads 6–7: 511 rows each  →  2 × 511 = 1022 rows
  3. total: 3072 + 1022 = 4094 rows ✓, max imbalance = 1 row = 0.2%

Imbalance under 1% means the static split is essentially perfect for this regular grid — no reason to reach for dynamic scheduling. The slowest thread sets the step time, so 0.2% imbalance costs us at most 0.2% of the parallel speedup.

Two shared-memory hazards the project must address explicitly:

Result  On one 8-core socket we expect a near-linear ~7× step-time speedup (matching the Amdahl 8-core figure of 6.6× once memory bandwidth is accounted for). See the OpenMP threads demo.

4 · Distributed-memory parallelism with MPI

One node runs out of cores and memory eventually. To use a whole cluster, each process owns a private slice of the grid in its own address space and the only way to share data is to send messages. We split the $N \times N$ grid by domain decomposition into a $\sqrt{P} \times \sqrt{P}$ array of tiles, one per MPI rank. A rank can update its interior alone — but the cells on a tile's edge need the neighbouring tile's edge values, which live in another process's memory.

Halo / ghost cells. Each tile is padded with a one-cell-thick border — the halo — that holds a copy of the neighbour's edge. Before every Jacobi step, neighbours exchange these borders. After the exchange, every cell has all four neighbours locally and the sweep proceeds exactly as in the serial code. This is the heartbeat of the program.

jacobi_mpi.c — one timestep on each rank

/* 1. exchange halos with the 4 cartesian neighbours (non-blocking) */
MPI_Irecv(halo_top,    N_loc, MPI_DOUBLE, nbr_up,   0, comm, &r[0]);
MPI_Irecv(halo_bottom, N_loc, MPI_DOUBLE, nbr_down, 1, comm, &r[1]);
MPI_Isend(row_first,   N_loc, MPI_DOUBLE, nbr_up,   1, comm, &r[2]);
MPI_Isend(row_last,    N_loc, MPI_DOUBLE, nbr_down, 0, comm, &r[3]);
/* ... same for left/right neighbours ... */
MPI_Waitall(8, r, MPI_STATUSES_IGNORE);

/* 2. local Jacobi sweep over this rank's tile (halos now valid) */
double res_local = stencil_sweep(u, new, n_loc, m_loc);

/* 3. global convergence: combine every rank's residual */
MPI_Allreduce(&res_local, &res_global, 1, MPI_DOUBLE, MPI_MAX, comm);
if (res_global < tol) break;

Two communication costs dominate, and both are worth modelling with the standard $\alpha$–$\beta$ model where $\alpha$ is per-message latency and $\beta$ is per-byte inverse bandwidth:

worked example Tree vs linear Allreduce at $P = 1024$

A scalar MPI_Allreduce over $P$ ranks. Count the communication rounds (each round $\approx \alpha$ latency) for the linear and the tree algorithms.

Plinear (P−1)tree (2·⌈log₂P⌉)speedup
161581.9×
6463125.3×
102410232051×

At 1024 ranks the tree does its reduce-then-broadcast in $2\lceil\log_2 1024\rceil = 20$ latency-bound rounds versus 1023 for the linear chain — a 51× reduction in latency cost. This is why the residual test, which Amdahl flagged as the serial bottleneck, must use a collective, never a hand-rolled gather-to-rank-0.

Overlapping communication and computation. The non-blocking MPI_Isend/Irecv above let us start the halo exchange, immediately update the tile's interior (which doesn't need halos) while messages are in flight, then update the thin boundary ring once MPI_Waitall returns. Done well this hides almost all communication latency behind useful work. Walk the exchange step-by-step in the MPI halo-exchange demo; the message-passing model itself is the right column of the Flynn taxonomy demo.

5 · GPU offload

A stencil is a near-perfect fit for a GPU: thousands of identical, independent cell updates map directly onto thousands of hardware threads. We launch one CUDA thread per interior cell, grouped into 2-D thread blocks (e.g. $16 \times 16 = 256$ threads) tiled across the grid. The GPU runs threads in lockstep warps of 32 under the SIMT (Single Instruction, Multiple Thread) model — the GPU corner of Flynn's taxonomy.

jacobi.cu — one thread per cell

__global__ void jacobi_step(const double* u, double* out, int N) {
    int j = blockIdx.x * blockDim.x + threadIdx.x;   // column
    int i = blockIdx.y * blockDim.y + threadIdx.y;   // row
    if (i > 0 && i < N-1 && j > 0 && j < N-1) {
        out[i*N + j] = 0.25 * ( u[(i-1)*N + j] + u[(i+1)*N + j]
                              + u[i*N + (j-1)] + u[i*N + (j+1)] );
    }
}
// launch: dim3 block(16,16); dim3 grid((N+15)/16, (N+15)/16);
//         jacobi_step<<<grid, block>>>(d_u, d_out, N);

Two properties decide whether the GPU actually goes fast:

Same conclusion, different hardware  CPU or GPU, the stencil's low flop-to-byte ratio means bandwidth, not arithmetic, is the ceiling — which is exactly what the roofline analysis next will quantify. Explore the launch geometry in the GPU / SIMT demo. The GPU is today's mainstream accelerator; the same "specialised hardware for a specific kernel shape" logic is what the field is now exploring with quantum accelerators for problems no classical stencil can express.

6 · Performance analysis

Implementation is half the project; the other half is measuring and explaining what you built. We report strong scaling (fixed problem, more cores), weak scaling (problem grows with cores), the Karp–Flatt experimental serial fraction, and where the kernel lands on the roofline. The tables below are modelled from the $p = 0.97$ kernel and a representative node memory bandwidth.

Strong scaling — fixed 4096² grid

Same grid, more cores. Speedup $S(P) = T_1/T_P$, efficiency $E = S/P$. This is the Amdahl regime, so $E$ falls as the serial residual test starts to dominate.

cores Ptime T (s)speedup Sefficiency E
1240.01.0×100%
2123.71.94×97%
465.23.68×92%
836.46.59×82%
1621.811.0×69%
3214.716.3×51%
6410.922.0×34%

Weak scaling — 4096² cells per core

Grid grows with $P$ so each core always holds a $4096^2$ tile. Ideal weak scaling keeps step time flat; the slow creep here is the $O(\log P)$ Allreduce plus halo cost.

cores Pgridstep time (ms)weak efficiency
14096²52.0100%
48192²53.198%
1616384²54.496%
6432768²56.293%
25665536²58.689%
worked example Karp–Flatt experimental serial fraction

The Karp–Flatt metric recovers the real serial fraction $e$ from measured speedup, capturing overheads Amdahl's idealised $s$ ignores: $\displaystyle e = \frac{1/S - 1/P}{1 - 1/P}$. Use the strong-scaling row at $P = 8$, $S = 6.59$:

  1. numerator: $1/S - 1/P = 1/6.59 - 1/8 = 0.1517 - 0.1250 = 0.0267$
  2. denominator: $1 - 1/P = 1 - 0.125 = 0.875$
  3. result: $e = 0.0267 / 0.875 = \mathbf{0.0305}$  (≈ 3.0%)
  4. check at $P = 64$, $S = 22.0$: $e = (0.04545 - 0.015625)/0.9844 = \mathbf{0.0303}$

$e \approx 0.030$ stays flat as $P$ grows, which is the signature of a genuine serial fraction (our residual test) rather than parallel-overhead growth. If $e$ had risen with $P$, the lost efficiency would point to communication overhead instead — the diagnostic value of Karp–Flatt. Compare both tables live in the scaling-laws demo.

worked example Roofline — arithmetic intensity of the stencil

Arithmetic intensity $I$ = flops ÷ bytes moved from DRAM. Per interior cell the 5-point Jacobi does 4 adds + 1 multiply = 5 flops. With double precision and the unavoidable streaming traffic (read the cell's source values + write the result; with good caching the four neighbours are reused from adjacent cells, so the irreducible traffic is ~1 read + 1 write = 16 bytes):

  1. flops/cell: 5
  2. bytes/cell (best case, cached neighbours): 8 (read) + 8 (write) = 16 B
  3. intensity: $I = 5 / 16 = \mathbf{0.31}$ flop/byte  (naïve, uncached: $5/40 = 0.125$)
  4. machine balance: a node with 200 GFLOP/s and 100 GB/s has ridge point $I^* = 200/100 = 2.0$ flop/byte

Since $I = 0.31 \ll I^* = 2.0$, the kernel sits firmly on the bandwidth-bound slope of the roofline, far left of the ridge. Attainable performance $= I \times \text{BW} = 0.31 \times 100 = \mathbf{31}$ GFLOP/s — only ~16% of peak, and no amount of faster arithmetic helps. The lever is bandwidth: cache blocking, shared-memory tiling on the GPU, or fusing timesteps to raise $I$. See the roofline demo and the memory-hierarchy demo.

Diagnosis  Strong scaling is capped by a flat 3% serial fraction (Karp–Flatt); weak scaling holds above 89% to 256 cores; and the roofline proves the per-core ceiling is memory bandwidth, not flops. The single most valuable optimisation is therefore reducing memory traffic, not adding arithmetic units.

7 · Robustness at scale

A simulation that runs for hours on hundreds of nodes will meet a node failure, a queue time limit, or a full disk. Production HPC code is judged as much on surviving these as on raw speed — two mechanisms make the project robust.

worked example Young's optimal checkpoint interval

A run on 256 nodes has a per-node MTBF of 5 years, so the system MTBF is $M = 5\text{ yr} / 256 \approx 6.4 \times 10^{5}$ s ≈ 178 h. A full checkpoint takes $\delta = 60$ s. Then $\tau_{\text{opt}} = \sqrt{2 \delta M}$:

  1. M in seconds: $5 \times 365 \times 24 \times 3600 / 256 \approx 6.16 \times 10^{5}$ s
  2. 2·δ·M: $2 \times 60 \times 6.16\times10^{5} = 7.39 \times 10^{7}$
  3. τ_opt: $\sqrt{7.39\times10^{7}} \approx 8597$ s ≈ 2.4 hours

So checkpoint roughly every 2.4 h, not every few minutes. As the job scales to more nodes the system MTBF shrinks ($M \propto 1/\text{nodes}$), so $\tau_{\text{opt}}$ shrinks too — at extreme scale checkpointing becomes a first-order design concern, not an afterthought.

8 · Scaling out to distributed ML

The syllabus extends HPC into distributed machine learning, and the bridge is tighter than it looks: training a large neural network is the same communication problem as our halo exchange, dressed in different vocabulary. Two parallel strategies mirror the two ways we split the stencil.

The same analysis tools carry over wholesale. Amdahl/Gustafson bound how much faster training gets with more GPUs; the $\alpha$–$\beta$ model predicts whether the gradient all-reduce (which grows with model size) will dominate; and the roofline tells you whether a given layer is compute- or memory-bound. A student who understands the Jacobi stencil already understands the skeleton of distributed SGD.

The unifying idea  Computation is local; correctness is global. Whether the global quantity is a convergence residual, a reduced gradient, or a boundary activation, the cost of agreeing on it across processors is what governs how far any HPC workload — physics or ML — can scale. Relate it back to the MapReduce demo: map = local stencil sweep, reduce = the all-reduce.

9 · Results & reflection

A sample modelled run, end to end — serial baseline, then the same $4096^2$ grid on 8 OpenMP threads, then a $32768^2$ weak-scaled grid on 64 MPI ranks across 8 nodes:

$ python jacobi.py            # serial baseline, 4096^2
converged in 1840 iters, residual 9.7e-05   wall 240.1 s

$ OMP_NUM_THREADS=8 ./jacobi_omp   # 1 node, 8 cores, 4096^2
converged in 1840 iters, residual 9.7e-05   wall 36.4 s  (6.6x)

$ mpirun -np 64 ./jacobi_mpi       # 8 nodes, 32768^2 weak scaled
grid 32768x32768  tiles 8x8  halo 8/step
converged in 1841 iters, residual 9.9e-05   step 56.2 ms  weak-eff 93%

# strong-scaling on the fixed grid, summarised:
$ cat scaling.txt
P=1  240.0s  1.0x  |  P=8  36.4s  6.6x  |  P=64  10.9s  22.0x

What worked. The kernel parallelised cleanly: OpenMP delivered a near-linear 6.6× on one socket with a 0.2% load imbalance, and MPI weak-scaling held above 89% to 256 cores because the communication is nearest-neighbour (halo) plus a logarithmic collective (Allreduce). Tree reductions and non-blocking, overlapped halo exchange kept the serial fraction pinned at the ~3% Karp–Flatt floor.

The bottleneck. The roofline analysis was decisive: at $I = 0.31$ flop/byte the stencil is memory-bandwidth-bound, reaching only ~16% of compute peak. Faster arithmetic is wasted; the real wins come from cache blocking and on-chip tiling that cut DRAM traffic. The secondary limit is the global residual test — a true serial fraction that caps strong scaling at ~33× no matter the core count.

Key idea  A successful HPC project is not "make it use all the cores" — it is a chain of evidence: profile to find the hot loop, model (Amdahl/Gustafson) to set expectations, implement at the right level (OpenMP → MPI → GPU), measure (strong/weak scaling, Karp–Flatt), and explain the ceiling (roofline). For this stencil every link in that chain pointed at the same culprit — memory bandwidth — and that diagnosis, not the raw speedup number, is the deliverable.

Mapping project stages to learning outcomes

How each stage of the build demonstrates a syllabus outcome and which live demo animates it:

Project stageLearning outcome demonstratedDemo
Problem & baselineModel a real problem; profile to find the hot loop roofline, memory
Amdahl / GustafsonQuantify the speedup ceiling before optimising amdahl, scaling
OpenMP threadingShared-memory parallelism, scheduling, NUMA, false sharing openmp, flynn
MPI + halo exchangeDistributed memory, decomposition, collectives, comm cost mpi
GPU offloadSIMT, coalescing, accelerator memory model gpu
Performance analysisStrong/weak scaling, Karp–Flatt, roofline placement scaling, roofline
Robustness at scaleCheckpoint/restart, parallel I/O, fault tolerance
Scaling out to MLData vs model parallelism; gradient all-reduce mapreduce
design implementation optimization analysis full course outline →

References