hpc-lab parallel · scaling · mpi · openmp · gpu
Course map

Syllabus roadmap — 6 modules, 30 sessions

This companion follows the official High Performance Computing syllabus (BCSAI, IE University) module by module. Every topic below links to the live demo that makes it tangible — from the speedup laws that bound every parallel program, through the roofline model and the memory hierarchy, to MPI collectives, OpenMP scheduling, GPU warps and a first look at qubits. 📄 Full syllabus (PDF)

1 Foundational knowledge & skills

  • Why HPC — limits of a single core speedup ↗
  • Flynn's taxonomy of architectures flynn ↗
  • The memory hierarchy & locality memory ↗
  • Sessions S1–S5

2 Parallel computing essentials

3 Performance optimization

4 Parallel algorithms & ML

  • MapReduce & data-parallel patterns mapreduce ↗
  • Collective communication mpi ↗
  • Ring all-reduce for distributed ML mpi ↗
  • Sessions S16–S20

5 Advanced parallel programming

  • Message passing with MPI mpi ↗
  • GPU SIMT, warps & divergence gpu ↗
  • Occupancy & lane utilisation gpu ↗
  • Sessions S21–S25

6 Frontiers of HPC

  • Quantum computing & qubits quantum ↗
  • Single-qubit gates H · X · Z quantum ↗
  • Exascale & future directions
  • Sessions S26–S30
Module 2 · Parallel essentials

1. Amdahl's & Gustafson's laws

If a fraction $p$ of a program is parallelisable, Amdahl caps the speedup on $n$ cores at $S(n)=\frac{1}{(1-p)+p/n}$, approaching $\frac{1}{1-p}$ no matter how many cores you add — the serial part dominates. Gustafson reframes it for problems that grow with the machine: $S(n)=n-(1-p)(n-1)$, which keeps scaling. Slide $p$ and $n$ and watch the two curves diverge.

Amdahl speedup
Gustafson speedup
max speedup (n→∞)
efficiency

n is stepped on a power-of-two scale (1…1024).

Module 2 · Parallel essentials

2. Strong vs weak scaling & Karp–Flatt

Measured speedup is $S=t_\text{base}/t$; efficiency is $E=S/n$. The Karp–Flatt metric recovers the experimentally-determined serial fraction $e=\frac{1/S-1/n}{1-1/n}$ — if $e$ grows with $n$, communication overhead (not just the serial part) is hurting you. The dashed curve adds a communication penalty $c$ that bends real speedup away from the ideal linear line.

speedup
parallel efficiency
Karp–Flatt serial fraction
Module 3 · Performance

3. The roofline model

A kernel's attainable rate is capped by either compute or memory: $P=\min(\text{peak},\ \text{BW}\times I)$, where $I$ is arithmetic intensity in FLOP/byte. Below the ridge point $I^\star=\text{peak}/\text{BW}$ you are memory-bound (on the sloped roof); above it, compute-bound (under the flat ceiling). Move the kernel's intensity and see which roof you hit.

attainable
ridge point I★
regime

Intensity slider is log₂ scale (FLOP/byte).

Module 3 · Performance

4. Memory hierarchy & AMAT

Access cost climbs by orders of magnitude as you descend registers → L1 → L2 → L3 → DRAM, so hit rates near the top dominate. Average Memory Access Time weights each level's latency by the probability of missing every level above it: $\text{AMAT}=\sum_i \ell_i\prod_{j<i} m_j$. Raise the cache hit rates and watch AMAT collapse toward L1 latency.

AMAT
DRAM-only baseline
speedup vs DRAM-only

Latencies are illustrative (ns), not pinned to one machine.

Module 1 · Foundations

5. Flynn's taxonomy

Flynn classifies architectures by how many instruction streams and data streams flow at once. The four cells — SISD, SIMD, MISD, MIMD — span everything from a scalar CPU to a GPU's vector lanes to a whole cluster. Toggle the two stream axes and the matching quadrant lights up with real hardware examples.

class
name

Module 5 · Advanced

6. MPI collective communication

A linear broadcast sends from the root to each of the other ranks in turn — $p-1$ steps. A tree broadcast doubles the number of senders each round, finishing in $\lceil\log_2 p\rceil$ steps. The same logarithmic idea powers reductions and the ring all-reduce that synchronises gradients in distributed training. Slide the process count and compare the patterns.

linear broadcast
tree broadcast
ring all-reduce
recursive doubling

Top: linear (root → each). Bottom: tree (log₂ rounds).

Module 3 · Performance

7. OpenMP loop scheduling

A #pragma omp parallel for splits loop iterations across threads. static hands out fixed chunks up front; dynamic and guided dole work out on demand. When the per-iteration cost is uneven (triangular), static leaves some threads idle — load imbalance, the ratio $\text{maxLoad}/\text{meanLoad}$, climbs above 1. Dynamic and guided keep it near 1.

load imbalance

Each row is a thread; blocks are its iterations.

Module 5 · Advanced

8. GPU SIMT warps & divergence

A GPU runs threads in lock-step groups of 32 called warps. A block of threads needs $\lceil t/32\rceil$ warps; if $t$ isn't a multiple of 32 the last warp wastes its spare lanes. Worse, when threads in one warp take different branches the hardware serialises the paths, so divergence efficiency falls to $1/\text{paths}$. Slide the thread count and branch paths to see both costs.

warps needed
wasted lanes
divergence efficiency

32 lanes per warp; grey lanes are wasted.

Module 4 · Parallel algorithms

9. MapReduce word count

MapReduce is the canonical data-parallel pattern: map turns each input record into key/value pairs, shuffle groups pairs by key, and reduce folds each group to one result. The classic example is word count — the map and reduce phases run embarrassingly in parallel, the shuffle is the only communication. Edit the text and watch the three phases recompute.

documents
distinct words
top word

map → shuffle → reduce, drawn as three columns.

Module 6 · Frontiers

10. Qubits & quantum gates

A qubit's state is a vector $\alpha|0\rangle+\beta|1\rangle$ with $|\alpha|^2+|\beta|^2=1$; measuring it yields 0 or 1 with those probabilities. Gates are matrices: $X$ flips, $Z$ phases, and the Hadamard $H$ builds an equal superposition. Apply a sequence of gates to $|0\rangle$ and watch the amplitude and probability bars update.

α (amp of |0⟩)
β (amp of |1⟩)
P(0) · P(1)

gate sequence: (none)