hpc-lab · visual high-performance computing — speedup, scaling, MPI, OpenMP, GPU & quantum

Course map

Syllabus roadmap — 6 modules, 30 sessions

This companion follows the official High Performance Computing syllabus (BCSAI, IE University) module by module. Every topic below links to the live demo that makes it tangible — from the speedup laws that bound every parallel program, through the roofline model and the memory hierarchy, to MPI collectives, OpenMP scheduling, GPU warps and a first look at qubits. 📄 Full syllabus (PDF)

1 Foundational knowledge & skills

Why HPC — limits of a single core speedup ↗
Flynn's taxonomy of architectures flynn ↗
The memory hierarchy & locality memory ↗
Sessions S1–S5

2 Parallel computing essentials

Amdahl's & Gustafson's laws speedup ↗
Strong vs weak scaling, Karp–Flatt scaling ↗
Shared memory with OpenMP openmp ↗
Sessions S6–S10

3 Performance optimization

The roofline model roofline ↗
Caches & AMAT memory ↗
OpenMP loop scheduling & balance openmp ↗
Sessions S11–S15

4 Parallel algorithms & ML

MapReduce & data-parallel patterns mapreduce ↗
Collective communication mpi ↗
Ring all-reduce for distributed ML mpi ↗
Sessions S16–S20

5 Advanced parallel programming

Message passing with MPI mpi ↗
GPU SIMT, warps & divergence gpu ↗
Occupancy & lane utilisation gpu ↗
Sessions S21–S25

6 Frontiers of HPC

Quantum computing & qubits quantum ↗
Single-qubit gates H · X · Z quantum ↗
Exascale & future directions
Sessions S26–S30

Module 2 · Parallel essentials

1. Amdahl's & Gustafson's laws

If a fraction $p$ of a program is parallelisable, Amdahl caps the speedup on $n$ cores at $S(n)=\frac{1}{(1-p)+p/n}$, approaching $\frac{1}{1-p}$ no matter how many cores you add — the serial part dominates. Gustafson reframes it for problems that grow with the machine: $S(n)=n-(1-p)(n-1)$, which keeps scaling. Slide $p$ and $n$ and watch the two curves diverge.

parallel fraction p 0.95 processors n 64

Amdahl speedup—

Gustafson speedup—

max speedup (n→∞)—

efficiency—

n is stepped on a power-of-two scale (1…1024).

Module 2 · Parallel essentials

2. Strong vs weak scaling & Karp–Flatt

Measured speedup is $S=t_\text{base}/t$; efficiency is $E=S/n$. The Karp–Flatt metric recovers the experimentally-determined serial fraction $e=\frac{1/S-1/n}{1-1/n}$ — if $e$ grows with $n$, communication overhead (not just the serial part) is hurting you. The dashed curve adds a communication penalty $c$ that bends real speedup away from the ideal linear line.

baseline time t_base 100 measured time t 22 processors n 8 comm coefficient c 0.04

speedup—

parallel efficiency—

Karp–Flatt serial fraction—

Module 3 · Performance

3. The roofline model

A kernel's attainable rate is capped by either compute or memory: $P=\min(\text{peak},\ \text{BW}\times I)$, where $I$ is arithmetic intensity in FLOP/byte. Below the ridge point $I^\star=\text{peak}/\text{BW}$ you are memory-bound (on the sloped roof); above it, compute-bound (under the flat ceiling). Move the kernel's intensity and see which roof you hit.

peak compute 2000 GFLOP/s memory bandwidth 200 GB/s arithmetic intensity 4 FLOP/byte

attainable—

ridge point I★—

regime—

Intensity slider is log₂ scale (FLOP/byte).

Module 3 · Performance

4. Memory hierarchy & AMAT

Access cost climbs by orders of magnitude as you descend registers → L1 → L2 → L3 → DRAM, so hit rates near the top dominate. Average Memory Access Time weights each level's latency by the probability of missing every level above it: $\text{AMAT}=\sum_i \ell_i\prod_{j<i} m_j$. Raise the cache hit rates and watch AMAT collapse toward L1 latency.

L1 hit rate 0.90 L2 hit rate 0.80 L3 hit rate 0.70

AMAT—

DRAM-only baseline—

speedup vs DRAM-only—

Latencies are illustrative (ns), not pinned to one machine.

Module 1 · Foundations

5. Flynn's taxonomy

Flynn classifies architectures by how many instruction streams and data streams flow at once. The four cells — SISD, SIMD, MISD, MIMD — span everything from a scalar CPU to a GPU's vector lanes to a whole cluster. Toggle the two stream axes and the matching quadrant lights up with real hardware examples.

instruction streams data streams

class—

name—

Module 5 · Advanced

6. MPI collective communication

A linear broadcast sends from the root to each of the other ranks in turn — $p-1$ steps. A tree broadcast doubles the number of senders each round, finishing in $\lceil\log_2 p\rceil$ steps. The same logarithmic idea powers reductions and the ring all-reduce that synchronises gradients in distributed training. Slide the process count and compare the patterns.

processes p 8

linear broadcast—

tree broadcast—

ring all-reduce—

recursive doubling—

Top: linear (root → each). Bottom: tree (log₂ rounds).

Module 3 · Performance

7. OpenMP loop scheduling

A #pragma omp parallel for splits loop iterations across threads. static hands out fixed chunks up front; dynamic and guided dole work out on demand. When the per-iteration cost is uneven (triangular), static leaves some threads idle — load imbalance, the ratio $\text{maxLoad}/\text{meanLoad}$, climbs above 1. Dynamic and guided keep it near 1.

iterations 32 threads 4 schedule chunk 2 uneven work (cost = i+1)

load imbalance—

Each row is a thread; blocks are its iterations.

Module 5 · Advanced

8. GPU SIMT warps & divergence

A GPU runs threads in lock-step groups of 32 called warps. A block of threads needs $\lceil t/32\rceil$ warps; if $t$ isn't a multiple of 32 the last warp wastes its spare lanes. Worse, when threads in one warp take different branches the hardware serialises the paths, so divergence efficiency falls to $1/\text{paths}$. Slide the thread count and branch paths to see both costs.

threads 100 divergent paths 1

warps needed—

wasted lanes—

divergence efficiency—

32 lanes per warp; grey lanes are wasted.

Module 4 · Parallel algorithms

9. MapReduce word count

MapReduce is the canonical data-parallel pattern: map turns each input record into key/value pairs, shuffle groups pairs by key, and reduce folds each group to one result. The classic example is word count — the map and reduce phases run embarrassingly in parallel, the shuffle is the only communication. Edit the text and watch the three phases recompute.

input text (one document per line / sentence)

documents—

distinct words—

top word—

map → shuffle → reduce, drawn as three columns.

Module 6 · Frontiers

10. Qubits & quantum gates

α (amp of |0⟩)—

β (amp of |1⟩)—

P(0) · P(1)—

gate sequence: (none)