From-Scratch Build · 02 · Parallel Computing
How do you make a program run on a thousand cores at once? This is a practical, notebook-driven course on high-performance computing — Slurm, OpenMP, MPI and CUDA — rebuilt from scratch to actually learn the tools instead of just reading about them.
What it is
Most code runs on a single core and waits. High-Performance Computing is the craft of splitting work so that many cores — across many machines — chew through it together. This course is the hands-on companion to a two-volume HPC textbook: a set of Jupyter notebooks that walk from "what is a supercomputer" all the way to writing GPU kernels.
What makes it practical is that the exercises are meant to be submitted to a real HPC cluster through a scheduler, while the lighter notebooks also run in Google Colab. So you learn the same workflow a researcher uses: write code, ask the scheduler for nodes, wait in the queue, collect results.
The core idea I wanted to learn: parallelism isn't one technique, it's a ladder. Threads share memory (OpenMP), processes pass messages (MPI), GPUs run thousands of tiny threads (CUDA), and a scheduler (Slurm) hands out the hardware. Each rung solves a different bottleneck.
The stack
These five are the vocabulary of every HPC centre. Rebuilding the course meant getting each one to actually run.
The cluster's traffic controller. You write a job script saying how many nodes and how long you need, then sbatch it and wait in the queue.
Add a #pragma and a loop runs across all the cores of one machine. The gentlest way into parallelism.
Message Passing Interface. Separate processes — possibly on different machines — coordinate by sending each other data. How you scale past one node.
Offload the heavy math to a GPU's thousands of cores. OpenACC eases you in with pragmas; CUDA gives full control.
Notebooks load straight from raw GitHub URLs into a hub running on the cluster — the delivery vehicle for every lesson.
Measuring before optimising: where are the cycles going, is it compute- or memory-bound, what's the speed-up vs. ideal.
Syllabus
The course climbs the parallelism ladder one module at a time.
What HPC is and how it evolved, the anatomy of a cluster, resource management and performance metrics, cloud & containerised HPC, and a real application domain (health & neuroscience). First contact with Slurm.
The heart of it: OpenMP for shared-memory threading, a deep dive into MPI for distributed work, and GPU computing with OpenACC and CUDA basics.
Making it actually fast — profiling, finding bottlenecks, and measuring speed-up against the theoretical ideal.
How it runs
sbatch, and watch the queue.In my rebuild I leaned on the Colab-runnable notebooks where I had no cluster, and mocked the Slurm submission step locally to understand the job-script anatomy.
Reflection