Syllabus roadmap — 6 modules, 30 sessions
This companion follows the official High Performance Computing syllabus (BCSAI, IE University) module by module. Every topic below links to the live demo that makes it tangible — from the speedup laws that bound every parallel program, through the roofline model and the memory hierarchy, to MPI collectives, OpenMP scheduling, GPU warps and a first look at qubits. 📄 Full syllabus (PDF)
1 Foundational knowledge & skills
2 Parallel computing essentials
3 Performance optimization
- The roofline model roofline ↗
- Caches & AMAT memory ↗
- OpenMP loop scheduling & balance openmp ↗
- Sessions S11–S15
4 Parallel algorithms & ML
- MapReduce & data-parallel patterns mapreduce ↗
- Collective communication mpi ↗
- Ring all-reduce for distributed ML mpi ↗
- Sessions S16–S20
5 Advanced parallel programming
1. Amdahl's & Gustafson's laws
If a fraction $p$ of a program is parallelisable, Amdahl caps the speedup on $n$ cores at $S(n)=\frac{1}{(1-p)+p/n}$, approaching $\frac{1}{1-p}$ no matter how many cores you add — the serial part dominates. Gustafson reframes it for problems that grow with the machine: $S(n)=n-(1-p)(n-1)$, which keeps scaling. Slide $p$ and $n$ and watch the two curves diverge.
n is stepped on a power-of-two scale (1…1024).
2. Strong vs weak scaling & Karp–Flatt
Measured speedup is $S=t_\text{base}/t$; efficiency is $E=S/n$. The Karp–Flatt metric recovers the experimentally-determined serial fraction $e=\frac{1/S-1/n}{1-1/n}$ — if $e$ grows with $n$, communication overhead (not just the serial part) is hurting you. The dashed curve adds a communication penalty $c$ that bends real speedup away from the ideal linear line.
3. The roofline model
A kernel's attainable rate is capped by either compute or memory: $P=\min(\text{peak},\ \text{BW}\times I)$, where $I$ is arithmetic intensity in FLOP/byte. Below the ridge point $I^\star=\text{peak}/\text{BW}$ you are memory-bound (on the sloped roof); above it, compute-bound (under the flat ceiling). Move the kernel's intensity and see which roof you hit.
Intensity slider is log₂ scale (FLOP/byte).
4. Memory hierarchy & AMAT
Access cost climbs by orders of magnitude as you descend registers → L1 → L2 → L3 → DRAM, so hit rates near the top dominate. Average Memory Access Time weights each level's latency by the probability of missing every level above it: $\text{AMAT}=\sum_i \ell_i\prod_{j<i} m_j$. Raise the cache hit rates and watch AMAT collapse toward L1 latency.
Latencies are illustrative (ns), not pinned to one machine.
5. Flynn's taxonomy
Flynn classifies architectures by how many instruction streams and data streams flow at once. The four cells — SISD, SIMD, MISD, MIMD — span everything from a scalar CPU to a GPU's vector lanes to a whole cluster. Toggle the two stream axes and the matching quadrant lights up with real hardware examples.
6. MPI collective communication
A linear broadcast sends from the root to each of the other ranks in turn — $p-1$ steps. A tree broadcast doubles the number of senders each round, finishing in $\lceil\log_2 p\rceil$ steps. The same logarithmic idea powers reductions and the ring all-reduce that synchronises gradients in distributed training. Slide the process count and compare the patterns.
Top: linear (root → each). Bottom: tree (log₂ rounds).
7. OpenMP loop scheduling
A #pragma omp parallel for splits loop iterations across threads. static hands out fixed chunks up front; dynamic and guided dole work out on demand. When the per-iteration cost is uneven (triangular), static leaves some threads idle — load imbalance, the ratio $\text{maxLoad}/\text{meanLoad}$, climbs above 1. Dynamic and guided keep it near 1.
Each row is a thread; blocks are its iterations.
8. GPU SIMT warps & divergence
A GPU runs threads in lock-step groups of 32 called warps. A block of threads needs $\lceil t/32\rceil$ warps; if $t$ isn't a multiple of 32 the last warp wastes its spare lanes. Worse, when threads in one warp take different branches the hardware serialises the paths, so divergence efficiency falls to $1/\text{paths}$. Slide the thread count and branch paths to see both costs.
32 lanes per warp; grey lanes are wasted.
9. MapReduce word count
MapReduce is the canonical data-parallel pattern: map turns each input record into key/value pairs, shuffle groups pairs by key, and reduce folds each group to one result. The classic example is word count — the map and reduce phases run embarrassingly in parallel, the shuffle is the only communication. Edit the text and watch the three phases recompute.
map → shuffle → reduce, drawn as three columns.
10. Qubits & quantum gates
A qubit's state is a vector $\alpha|0\rangle+\beta|1\rangle$ with $|\alpha|^2+|\beta|^2=1$; measuring it yields 0 or 1 with those probabilities. Gates are matrices: $X$ flips, $Z$ phases, and the Hadamard $H$ builds an equal superposition. Apply a sequence of gates to $|0\rangle$ and watch the amplitude and probability bars update.
gate sequence: (none)