HPC Course — Built From Scratch

What it is

From one core to a cluster

Most code runs on a single core and waits. High-Performance Computing is the craft of splitting work so that many cores — across many machines — chew through it together. This course is the hands-on companion to a two-volume HPC textbook: a set of Jupyter notebooks that walk from "what is a supercomputer" all the way to writing GPU kernels.

What makes it practical is that the exercises are meant to be submitted to a real HPC cluster through a scheduler, while the lighter notebooks also run in Google Colab. So you learn the same workflow a researcher uses: write code, ask the scheduler for nodes, wait in the queue, collect results.

The core idea I wanted to learn: parallelism isn't one technique, it's a ladder. Threads share memory (OpenMP), processes pass messages (MPI), GPUs run thousands of tiny threads (CUDA), and a scheduler (Slurm) hands out the hardware. Each rung solves a different bottleneck.

The stack

Tools of the trade

These five are the vocabulary of every HPC centre. Rebuilding the course meant getting each one to actually run.

scheduler

Slurm

The cluster's traffic controller. You write a job script saying how many nodes and how long you need, then sbatch it and wait in the queue.

shared memory

OpenMP

Add a #pragma and a loop runs across all the cores of one machine. The gentlest way into parallelism.

distributed

MPI

Message Passing Interface. Separate processes — possibly on different machines — coordinate by sending each other data. How you scale past one node.

accelerators

CUDA / OpenACC

Offload the heavy math to a GPU's thousands of cores. OpenACC eases you in with pragmas; CUDA gives full control.

environment

JupyterHub

Notebooks load straight from raw GitHub URLs into a hub running on the cluster — the delivery vehicle for every lesson.

tuning

Profiling

Measuring before optimising: where are the cycles going, is it compute- or memory-bound, what's the speed-up vs. ideal.

Syllabus

Three modules, building up

The course climbs the parallelism ladder one module at a time.

Foundations & Architecture
What HPC is and how it evolved, the anatomy of a cluster, resource management and performance metrics, cloud & containerised HPC, and a real application domain (health & neuroscience). First contact with Slurm.
Parallel Programming
The heart of it: OpenMP for shared-memory threading, a deep dive into MPI for distributed work, and GPU computing with OpenACC and CUDA basics.
Performance Tuning
Making it actually fast — profiling, finding bottlenecks, and measuring speed-up against the theoretical ideal.

How it runs

The cluster workflow

Open a notebook from a URL. JupyterHub pulls each lesson straight from raw GitHub — no cloning, no setup drift.
Write the parallel code. An OpenMP loop, an MPI rank exchange, a CUDA kernel.
Submit to Slurm. Wrap it in a job script, request nodes and a time limit, sbatch, and watch the queue.
Collect & profile. Read the output, time it, compare against the serial baseline, then tune.

In my rebuild I leaned on the Colab-runnable notebooks where I had no cluster, and mocked the Slurm submission step locally to understand the job-script anatomy.

Reflection

What rebuilding it taught me

The scheduler is the real interface. You rarely "log into a supercomputer" and run things — you describe a job and wait. Slurm is the user experience of HPC.
Shared vs. distributed memory is the fork in the road. OpenMP and MPI aren't competitors; they answer "is my data on one machine or many?"
GPUs reward a different shape of problem. Thousands of tiny identical operations fly; branchy, sequential logic stalls.
Measure first, optimise second. The performance-tuning module is the whole point — speed-up is meaningless without a baseline.

From one core to a cluster

Tools of the trade

Slurm

OpenMP

MPI

CUDA / OpenACC

JupyterHub

Profiling

Three modules, building up

Foundations & Architecture

Parallel Programming

Performance Tuning

The cluster workflow

What rebuilding it taught me