← all builds

From-Scratch Build · 02 · Parallel Computing

HPC Course

How do you make a program run on a thousand cores at once? This is a practical, notebook-driven course on high-performance computing — Slurm, OpenMP, MPI and CUDA — rebuilt from scratch to actually learn the tools instead of just reading about them.

SlurmOpenMPMPI CUDA / GPUJupyterColab-runnable

What it is

From one core to a cluster

Most code runs on a single core and waits. High-Performance Computing is the craft of splitting work so that many cores — across many machines — chew through it together. This course is the hands-on companion to a two-volume HPC textbook: a set of Jupyter notebooks that walk from "what is a supercomputer" all the way to writing GPU kernels.

What makes it practical is that the exercises are meant to be submitted to a real HPC cluster through a scheduler, while the lighter notebooks also run in Google Colab. So you learn the same workflow a researcher uses: write code, ask the scheduler for nodes, wait in the queue, collect results.

The core idea I wanted to learn: parallelism isn't one technique, it's a ladder. Threads share memory (OpenMP), processes pass messages (MPI), GPUs run thousands of tiny threads (CUDA), and a scheduler (Slurm) hands out the hardware. Each rung solves a different bottleneck.

The stack

Tools of the trade

These five are the vocabulary of every HPC centre. Rebuilding the course meant getting each one to actually run.

scheduler

Slurm

The cluster's traffic controller. You write a job script saying how many nodes and how long you need, then sbatch it and wait in the queue.

shared memory

OpenMP

Add a #pragma and a loop runs across all the cores of one machine. The gentlest way into parallelism.

distributed

MPI

Message Passing Interface. Separate processes — possibly on different machines — coordinate by sending each other data. How you scale past one node.

accelerators

CUDA / OpenACC

Offload the heavy math to a GPU's thousands of cores. OpenACC eases you in with pragmas; CUDA gives full control.

environment

JupyterHub

Notebooks load straight from raw GitHub URLs into a hub running on the cluster — the delivery vehicle for every lesson.

tuning

Profiling

Measuring before optimising: where are the cycles going, is it compute- or memory-bound, what's the speed-up vs. ideal.

Syllabus

Three modules, building up

The course climbs the parallelism ladder one module at a time.

  1. Foundations & Architecture

    What HPC is and how it evolved, the anatomy of a cluster, resource management and performance metrics, cloud & containerised HPC, and a real application domain (health & neuroscience). First contact with Slurm.

  2. Parallel Programming

    The heart of it: OpenMP for shared-memory threading, a deep dive into MPI for distributed work, and GPU computing with OpenACC and CUDA basics.

  3. Performance Tuning

    Making it actually fast — profiling, finding bottlenecks, and measuring speed-up against the theoretical ideal.

How it runs

The cluster workflow

In my rebuild I leaned on the Colab-runnable notebooks where I had no cluster, and mocked the Slurm submission step locally to understand the job-script anatomy.

Reflection

What rebuilding it taught me