Course Structure · High Performance Computing

Overview

What this course is

An introduction to high-performance computing with a modern perspective: parallel coding, distributed systems, and applications in machine and deep learning. It covers the evolution and fundamentals of HPC, architectural design, resource management and performance optimization, with heavy emphasis on the parallel-programming essentials — MPI and OpenMP — plus GPU computing and cloud-based HPC. The curriculum builds toward distributed ML/DL at scale and closes on the frontiers of the field: neuromorphic, quantum, and post-exascale computing. Theory alternates with hands-on sessions; the MiniWeather stencil project is one such hands-on artefact.

ProgramBCSAI — Bachelor in Computer Science & AI

Course codeHPC-CSAI.3.M.A

AreaComputer Science

Sessions30

Credits6.0 ECTS

Academic year2025–26

Degree courseThird

Semester1st

CategoryCompulsory

LanguageEnglish

ProfessorDr. Oscar Diez Gonzalez

Contactodiezg@faculty.ie.edu

Learning objectives

What you should be able to do

The course is built around six competency themes. By the end, students command HPC hardware and software, the major parallel paradigms (MPI, OpenMP, OpenACC), performance measurement and optimization, and an informed view of where supercomputing is heading.

Foundational knowledge & skills

Explore the evolution, principles and architecture of HPC systems; understand resource management, performance metrics and the dynamics of parallel systems; set up environments and write simple parallel codes, with cloud-based applications in health and neurosciences.

Advanced parallel computing

Master distributed vs. shared-memory concepts, MPI programming and OpenMP multithreading; a special focus on GPU computing and OpenACC/CUDA, integrating MPI with OpenMP for complex simulations such as CFD.

Performance optimization

Optimize code for diverse HPC architectures; use profiling and analysis tools; exploit the memory hierarchy and data locality; leverage high-performance libraries, parallel I/O and data-management strategies.

Parallel algorithms & ML

Design scalable parallel algorithms for big data, high-performance data structures and visualization; understand distributed ML and scalable deep learning, culminating in a distributed-ML implementation project.

Advanced parallel programming

Work with hybrid-computing patterns, advanced MPI and OpenMP, and accelerators such as FPGAs; apply them to climate and terrestrial-systems modelling using hybrid programming approaches.

Frontiers of HPC

Survey neuromorphic and quantum computing and the challenges of post-exascale computing, anticipating the future direction of supercomputing.

Methodology & assessment

How time and grades are split

IE's method is collaborative, active and applied. The 150-hour workload (6 ECTS) is distributed across lectures, discussion, in-class and asynchronous exercises, group work and individual study. Grading blends continuous assessment with two exams.

Learning activities · 150 h

Lectures60.0 hours

40.0%

Exercises / async / field work30.0 hours

20.0%

Individual studying25.0 hours

16.7%

Discussions20.0 hours

13.3%

Group work15.0 hours

10.0%

Assessment weighting

Final examComprehensive, all modules

30%

Individual workPer-module assignments

15%

Intermediate testsQuizzes + intermediate exam

15%

WorkgroupsPaper + oral presentation

15%

Mid-course examFirst half of the course

15%

Class participationAttendance mandatory

10%

Attendance: 80% minimum, or both calls for the year are forfeited. GenAI policy: permitted for research, ideation and proofreading with acknowledgement — not for assignments, coding, group submissions or exams.

What each component asks of you

Class participation

10%

Deliverable: active, informed contribution to discussions and in-class activities across the semester.

Evaluated on: quality and consistency of engagement; attendance is mandatory and feeds this component.

Individual work

15%

Deliverable: one individual assignment per module applying lecture concepts to a real problem or case study.

Evaluated on: correctness, applied understanding and clarity of the solution.

Intermediate tests

15%

Deliverable: regular short quizzes plus an intermediate exam checking comprehension of recent lectures.

Evaluated on: progressive mastery; designed to reinforce continuous learning.

Workgroups

15%

Deliverable: a group project on a selected HPC topic — a written paper plus an oral presentation (Session 30).

Evaluated on: the design, implementation and optimisation of an HPC solution, and the quality of the presentation.

Mid-course exam

15%

Deliverable: a written individual exam held mid-semester covering the first half of the course (Modules 1–3).

Evaluated on: theoretical understanding of the foundational and parallel-essentials material.

Final exam

30%

Deliverable: a comprehensive written exam at the end of the course (Session 30) spanning all six modules.

Evaluated on: both theoretical and applied understanding; the single largest component.

Pass & re-sit rules: students have four chances across two academic years (ordinary + extraordinary June/July re-sits). Failing the ordinary call leads to a comprehensive June/July re-sit graded on that exam alone (continuous assessment is dropped), with a minimum passing grade of 5 and a maximum of 8.0; the third-call re-take exam can reach 10.0. Students below the 80% attendance threshold forfeit both calls and must re-enrol the following year. Re-sits require physical presence on campus (Segovia or Madrid); grade appeals require attending the exam review session first.

Program · 6 modules · 30 sessions

The full course, session by session

Every numbered session from the syllabus, grouped by module. Live in-person sessions carry a Live marker; self-paced ones a Async marker. Tags flag the parallel-computing tools each session uses — the same MPI / OpenMP / CUDA stack the MiniWeather stencil project in this repo implements four ways.

Module 1

Foundational knowledge & skills

The groundwork: the evolution, principles and architecture of HPC systems, resource management, performance metrics, cloud-based HPC and a first pass at writing simple parallel codes. This module answers the orienting questions — what makes a machine “high performance,” how those machines are organised, how their time is shared between many users, and how you log in and run your first parallel job.

By the end of Module 1 you can

Explain what distinguishes HPC from ordinary computing and read a supercomputer’s key metrics (FLOP/s, memory bandwidth, interconnect latency).
Classify a machine using Flynn’s taxonomy and recognise the main parallel-architecture families, including heterogeneous CPU+GPU nodes.
Describe how a batch scheduler allocates nodes and interpret common performance benchmarks (LINPACK / HPL, HPCG).
Outline the cloud HPC service models and weigh their trade-offs against on-premise clusters.
Set up an HPC environment, compile, and submit a first simple parallel program to a queue.

01Live

1.1

Evolution and fundamentals of HPC Live

Lay the groundwork for the whole course: structure, assignments, grading and class dynamics. Core HPC concepts, the anatomy of a supercomputer, key performance indicators and a short history of supercomputing.

What HPC is
Supercomputer anatomy
Key performance indicators
History of supercomputing
Assignment methodology

What HPC is — aggregating many processors, fast memory and a low-latency interconnect to solve one problem far faster than a single machine could. Supercomputer anatomy — compute nodes, accelerators, a high-speed network (InfiniBand / Slingshot), a parallel filesystem and a login/scheduler layer. Key performance indicators — peak vs. sustained FLOP/s, memory bandwidth, and interconnect latency/bandwidth. History — from Cray vector machines through Beowulf commodity clusters to today’s GPU-accelerated exascale systems (Frontier, LUMI, Leonardo).

Key idea — FLOP/s: a system rated at 1 PFLOP/s performs 10¹⁵ floating-point operations per second; the TOP500 ranks machines by sustained HPL (LINPACK) performance, which is always below the theoretical peak.

Read: Sterling et al., ch. 1–2 — a comprehensive, accessible orientation to what supercomputers are and why they matter; sets vocabulary used for the rest of the course.

02Live

1.2

Architectural overview of HPC systems Live

Demystify the parallel-architecture families through Flynn's taxonomy and the technologies enabling supercomputing today, ending on multiprocessors and heterogeneous computing.

Flynn's taxonomy
Parallel architecture families
Processors
Multiprocessors
Heterogeneous computing

Flynn’s taxonomy classifies machines by how many instruction and data streams run at once — SISD, SIMD, MISD and MIMD. Architecture families span shared-memory SMPs, distributed-memory clusters and vector/SIMD units. Heterogeneous computing pairs general-purpose CPUs with throughput-oriented accelerators (GPUs) on the same node.

Key idea — SIMD vs. MIMD: a GPU is essentially a huge SIMD engine (one instruction, thousands of data elements), whereas an MPI cluster is MIMD (each rank runs its own instruction stream over its own data). Most real HPC codes combine both.

Project link: the four MiniWeather backends — serial, cache-blocked, CPU-parallel, GPU — are exactly the heterogeneous-architecture spectrum introduced here.

Read: Sterling et al., ch. 3–4 (architecture); Intro to HPC for Scientists & Engineers, ch. 1 — a concise tour of mainstream parallel-computer architecture and where the performance comes from.

03Live

1.3

Resource management & performance metrics in parallel systems Live

Operational HPC: job scheduling strategies, resource allocation and the performance benchmarks that keep parallel environments running efficiently.

Job scheduling
Resource allocation
Performance benchmarks
Cluster operation

Job scheduling — a batch scheduler (SLURM, PBS) queues jobs and matches their resource requests (nodes, cores, GPUs, walltime) to free hardware, balancing throughput against fairness. Performance benchmarks — HPL/LINPACK stresses dense floating-point; HPCG stresses sparse, memory-bound kernels; STREAM measures memory bandwidth — together they bracket how a real code will behave.

Key idea — backfill scheduling: short jobs are slotted into gaps ahead of a large queued job as long as they don’t delay its reserved start, which is why an accurate walltime estimate gets you scheduled sooner.

Project link: MiniWeather ships SLURM *.sbatch / *.slurm scripts for a Magic Castle cluster — job scheduling in practice.

Read: Sterling et al., ch. 5 (resource management) and the SLURM quickstart documentation — the operational backdrop for every later hands-on session.

04Live

1.4

Introduction to cloud-based HPC and virtualization Live

How cloud computing reshaped HPC accessibility and scalability: virtualization basics, cloud service models, and the benefits and challenges of cloud HPC in scientific workflows.

Virtualization
Cloud service models
Cloud HPC trade-offs
Scientific workflows

Virtualization lets one physical machine present many isolated virtual ones; lightweight containers (Singularity/Apptainer, Docker) package an application’s whole software stack for reproducibility. Service models — IaaS, PaaS and SaaS — trade control for convenience. Trade-offs — cloud gives elastic capacity and zero capital cost but adds virtualization overhead, slower interconnects and egress fees that can hurt tightly-coupled MPI jobs.

Key idea — elasticity vs. coupling: embarrassingly-parallel workloads (parameter sweeps, ML inference) scale beautifully on cloud; latency-bound MPI codes with frequent halo exchange often still favour a dedicated low-latency cluster.

Read: Sterling et al., ch. on cloud & commodity clusters; vendor HPC whitepapers (AWS ParallelCluster, Azure CycleCloud) — concrete cloud-HPC reference architectures.

05Async

1.5

HPC in health & neurosciences · environment setup & simple parallel codes Async

Set up HPC environments and write first parallel codes, with examples from health and neuroscience research. Includes the individual and group assignment explanation.

Environment setup
First parallel codes
Health & neuroscience cases
Assignment briefing

Environment setup — load compiler and MPI modules (module load), build with mpicc/nvcc, and submit through the scheduler. First parallel codes — a “hello, ranks” MPI program and a simple OpenMP loop that establish the mental model of many workers cooperating. Health & neuroscience cases — genomics pipelines, medical-image reconstruction and large-scale brain simulation show why this compute matters.

Key idea — concrete companion: the MiniWeather project in this repo is the worked example you keep returning to — clone it, build the serial backend, then watch the parallel ones reproduce the same field faster.

individual + group assignments

Read: Robey & Zamora, ch. 1–2 — a hands-on on-ramp to building and running parallel code, ideal for the first assignment.

Module 2

Parallel computing essentials

Distributed vs. shared memory, MPI and OpenMP, GPU computing with OpenACC/CUDA, and a CFD case study integrating MPI + OpenMP — the heart of the parallel-programming skillset the MiniWeather project exercises. This is where the abstract architectures of Module 1 become code you actually write: decomposing a problem, communicating between workers, and choosing the right paradigm for the hardware in front of you.

By the end of Module 2 you can

Contrast shared- and distributed-memory models and pick the right one (or both) for a given problem and machine.
Write MPI programs using point-to-point and collective communication, and decompose a grid across ranks with a halo (ghost-cell) exchange.
Parallelise loops with OpenMP, choosing scheduling and reduction clauses and avoiding data races.
Map a data-parallel kernel onto a GPU with CUDA/OpenACC, reasoning about threads, blocks and coalesced memory access.
Combine MPI and OpenMP into a hybrid program and apply it to a CFD-style stencil computation.

06Live

2.1

Distributed and shared memory concepts Live

The core architectures of parallel computing: distributed vs. shared memory, their advantages, challenges and suitable scenarios, and how each shapes programming paradigms and performance.

Shared memory
Distributed memory
Programming paradigms
Performance trade-offs

Shared memory — all cores see one address space, so communication is just reading the same variable, but synchronization and cache coherence become the bottleneck (the OpenMP model). Distributed memory — each node owns private memory and must explicitly send messages to share data (the MPI model), which scales to thousands of nodes at the cost of programmer effort. Trade-offs centre on the cost of communication versus the cost of synchronization.

Key idea — Amdahl vs. Gustafson: Amdahl’s law caps speedup at 1/(s + p/N) for a fixed problem with serial fraction s; Gustafson’s law observes that in practice you grow the problem with N, so weak scaling stays useful well past Amdahl’s pessimistic ceiling.

Project link: MiniWeather's halo exchange (halo.cpp) is the distributed-memory boundary problem made concrete.

Read: Robey & Zamora, ch. 3–4; Intro to HPC for Scientists & Engineers, ch. on parallel computers — clear treatment of the two memory models and the scaling laws.

07Live

2.2

Deep dive into MPI programming Live

MPI from basic to intermediate: communication patterns, data distribution and collective operations, with hands-on practice writing and running MPI programs on distributed-memory systems.

Point-to-point comms
Collective operations
Data distribution
Hands-on MPI

Point-to-point — MPI_Send/MPI_Recv (and their non-blocking MPI_Isend/MPI_Irecv forms) move data between two named ranks. Collectives — MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Reduce, MPI_Allreduce coordinate all ranks at once and are implemented with optimised tree algorithms. Data distribution — splitting a domain so each rank owns a contiguous block keeps communication local to neighbours.

Key idea — halo (ghost-cell) exchange: a rank pads its sub-grid with a border copied from its neighbours each step, so the stencil can be applied locally without per-cell messaging. Overlapping this exchange with interior computation (non-blocking sends) hides the communication latency.

MPI

Project link: stencil_cpu_parallel.cpp partitions the 3D grid across MPI ranks and exchanges halo layers each step.

Read: Robey & Zamora, ch. 8 (MPI); the MPI Forum standard / Using MPI reference for the call signatures used in the lab.

08Live

2.3

OpenMP for multithreading Live

Shared-memory parallelism with OpenMP directives: workload distribution, synchronization and memory management for multicore processors, with interactive performance experiments.

OpenMP directives
Workload distribution
Synchronization
Multicore scaling

Directives — #pragma omp parallel for spreads loop iterations across a team of threads sharing memory. Workload distribution — the schedule clause (static, dynamic, guided) trades load-balance against overhead. Synchronization — critical, atomic and reduction prevent two threads from clobbering the same memory. Scaling is limited by memory bandwidth and false sharing, not just core count.

Key idea — scheduling choice: use static when every iteration costs the same (best cache reuse, least overhead) and dynamic/guided when iteration cost is uneven — at the price of extra runtime bookkeeping.

OpenMP

Project link: the same stencil_cpu_parallel.cpp threads across cores with OpenMP inside each MPI rank.

Read: Robey & Zamora, ch. 7 (OpenMP); the OpenMP API examples document — directive-by-directive walkthroughs.

09Live

2.4

Introduction to GPU computing · OpenACC/CUDA basics Live

GPU architecture and programming with CUDA and OpenACC: develop and optimize parallel algorithms that exploit the massive parallelism of GPUs for compute-intensive tasks.

GPU architecture
CUDA
OpenACC
Massive parallelism

GPU architecture — thousands of lightweight cores grouped into streaming multiprocessors, optimised for throughput over latency. CUDA exposes this explicitly: you write a kernel launched over a grid of thread blocks, where threads in a block share fast on-chip memory. OpenACC takes a directive-based shortcut (#pragma acc) that the compiler turns into GPU code, trading control for productivity.

Key idea — grid / block / warp: threads execute in groups of 32 (a warp) in lockstep; coalesced memory access (consecutive threads reading consecutive addresses) and using shared memory as a manual cache are the two biggest levers for GPU performance.

CUDAOpenACC

Project link: stencil_gpu.cu maps each grid cell to a CUDA thread with coalesced loads and shared-memory tiling.

Read: Robey & Zamora, ch. 9–12 (GPU/CUDA/OpenACC); the CUDA C++ Programming Guide, intro chapters.

10Async

2.5

Computational Fluid Dynamics (CFD) with HPC · MPI + OpenMP integration Async

Apply parallel computing to CFD: engage with simulations using MPI and OpenMP together, gaining practical experience in hybrid parallel programming across distributed and shared memory.

CFD simulations
Hybrid MPI + OpenMP
Fluid dynamics
Scientific computing

CFD discretises the governing fluid equations onto a grid and advances them step by step — a textbook HPC workload because each cell depends only on its neighbours. Hybrid MPI + OpenMP — MPI distributes the grid across nodes while OpenMP threads the work within each node, matching the two-level structure of modern clusters (many nodes, many cores per node).

Key idea — why hybrid: one MPI rank per node with OpenMP threads inside cuts the number of MPI messages and the memory spent on halo buffers, compared with one rank per core — often the difference between scaling and stalling at high core counts.

MPIOpenMP

Project link: MiniWeather is itself a stencil-based fluid/heat solver — the closest analogue to this CFD case in the repo.

Read: Robey & Zamora, ch. on hybrid programming; the original MiniWeather mini-app README for a worked CFD-style stencil.

Module 3

Performance optimization in HPC

Optimizing code for HPC architectures: profiling and analysis, checkpointing, the memory hierarchy and data locality, high-performance libraries, parallel I/O and parallel filesystems, with cluster performance tuning. The recurring lesson of this module is that on modern hardware most scientific codes are limited by data movement, not arithmetic — so optimization is largely about keeping the right data close to the processor at the right time.

By the end of Module 3 you can

Profile a parallel application, read the results, and locate the dominant bottleneck before changing code.
Restructure loops (unrolling, vectorization, blocking) to raise arithmetic intensity and cache reuse.
Reason about the memory hierarchy and place a kernel on the roofline model to know whether it is compute- or memory-bound.
Use high-performance libraries (BLAS, LAPACK, FFTW) instead of hand-rolling numerical kernels.
Apply parallel I/O and checkpointing so large, long-running jobs are both scalable and fault-tolerant.

11Live

3.1

Optimizing code for HPC architectures Live

Strategies to fully exploit HPC architectures: code profiling, algorithmic choices and data-structure optimization, with loop unrolling, vectorization, memory-access patterns and cache utilization.

Loop unrolling
Vectorization
Memory-access patterns
Cache utilization

Loop unrolling reduces loop-control overhead and exposes more independent work to the pipeline. Vectorization (SIMD) packs several array elements into one wide register so a single instruction processes 4–16 values at once. Memory-access patterns — stride-1, contiguous access lets the hardware prefetch and fully use each cache line. Cache utilization is the master variable: a kernel that streams data once is fundamentally slower than one that reuses it.

Key idea — cache locality: a cache line is typically 64 bytes; accessing memory with stride 1 uses all of it, while a large stride wastes most of every line. The roofline model formalises this — performance is capped by min(peak FLOP/s, bandwidth × arithmetic intensity).

Project link: stencil_cpu_blocked.cpp tiles the loops so each block fits in L1/L2 — same FLOPs, far fewer cache misses.

Read: Intro to HPC for Scientists & Engineers, ch. on optimization & the memory hierarchy — the definitive treatment of single-node tuning; Barba & Forsyth for the Python angle.

12Live

3.2

Profiling & performance analysis tools · checkpointing Live

Tools and methodologies for profiling CPU and memory performance, plus checkpointing for fault tolerance in long-running computations — identifying and mitigating bottlenecks.

CPU/memory profiling
Bottleneck analysis
Checkpointing
Fault tolerance

Profiling tools (perf, gprof, Intel VTune, NVIDIA Nsight) sample where time and cache misses actually occur — replacing guesswork with evidence. Bottleneck analysis follows Amdahl: optimise the part that dominates the runtime, not the part that is easiest. Checkpointing periodically saves application state to disk so a multi-day job can restart after a node failure instead of starting over.

Key idea — measure first: intuition about hot spots is wrong surprisingly often. Profile, find the “5% of code that takes 80% of the time,” optimise that, then re-profile — optimization is a loop, not a one-shot.

Project link: MiniWeather's timer.cpp and profiling/ directory are where each backend's runtime is measured.

Read: Robey & Zamora, ch. on profiling tools; tool docs for perf and Nsight Systems — the practical reference for the profiling assignment.

13Live

3.3

Memory hierarchy & data locality · high-performance libraries Live

Principles of the memory hierarchy from registers to disk, strategies to maximize data locality, and the high-performance scientific libraries that provide optimized math primitives.

Memory hierarchy
Data locality
Scientific libraries
Optimized primitives

Memory hierarchy — registers, L1/L2/L3 cache, DRAM, then disk, each roughly an order of magnitude larger but slower than the last. Data locality — temporal (reuse data before it is evicted) and spatial (use neighbouring data already in the cache line) — is what keeps the fast levels busy. High-performance libraries — BLAS, LAPACK, FFTW, cuBLAS — encapsulate decades of tuning so you rarely beat them by hand.

Key idea — don’t reinvent the kernel: a vendor-tuned dgemm (matrix multiply) can run at >90% of peak FLOP/s; a naive triple loop reaches a few percent. Reach for the library before writing the loop.

Read: Intro to HPC for Scientists & Engineers, memory-hierarchy chapter; the BLAS/LAPACK and FFTW user guides.

14Live

3.4

Parallel I/O & high-throughput data management · parallel filesystems Live

Parallel I/O challenges and solutions: parallel filesystem architecture, best practices for scalable I/O, and managing large datasets in distributed HPC environments.

Parallel I/O
Parallel filesystems
Scalable I/O
Large datasets

Parallel I/O lets many ranks read or write one shared file at once through MPI-IO and self-describing formats (HDF5, NetCDF, ADIOS) rather than each rank opening its own file. Parallel filesystems (Lustre, GPFS/Spectrum Scale, BeeGFS) stripe a file across many storage servers so aggregate bandwidth scales with the cluster.

Key idea — I/O can dominate: at scale, writing checkpoints and results often costs more wall-clock time than computing them. Collective, striped I/O to a parallel filesystem — not thousands of small per-rank files — is what keeps that cost in check.

Read: Sterling et al., storage & I/O chapters; the HDF5 and Lustre best-practices guides.

15Async

3.5

Molecular systems & material sciences · performance tuning on clusters Async

Apply optimization in a real scientific context: case studies in computational chemistry and physics, with assignments tuning simulation-code performance on HPC clusters.

Computational chemistry
Material sciences
Cluster tuning
Simulation codes

Computational chemistry & materials — molecular dynamics (GROMACS, LAMMPS) and electronic-structure codes (VASP, Quantum ESPRESSO) are among the largest consumers of supercomputer time. Cluster tuning — the assignment applies this module’s techniques to a real simulation: profile it, fix the dominant bottleneck, and document the speedup with a scaling study.

Key idea — strong vs. weak scaling: a strong-scaling study fixes the problem and adds cores (watch for the Amdahl ceiling); a weak-scaling study grows the problem with the cores (watch for communication overhead). Report both.

tuning assignment

Read: domain mini-app docs (GROMACS / LAMMPS performance guides); Robey & Zamora’s scaling-study chapter for the experimental method.

Module 4

Parallel algorithms & machine learning

Scalable parallel algorithms for big data, high-performance data structures and visualization, and the fundamentals of distributed ML and deep learning at scale — culminating in a distributed-ML project. This module turns the parallel-computing machinery of Modules 2–3 toward data-intensive and AI workloads, the fastest-growing use of supercomputers today.

By the end of Module 4 you can

Design a scalable parallel algorithm and express a data-processing job in the MapReduce model.
Choose data structures (and layouts such as structure-of-arrays) that stay efficient in parallel and distributed settings.
Distinguish data parallelism from model parallelism and explain how each scales a training job across nodes.
Identify the bottlenecks of large-scale deep-learning training and the techniques (gradient accumulation, batch scaling, mixed precision) that address them.
Implement and run a distributed ML model end-to-end as the module project.

16Live

4.1

Scalable parallel algorithms for Big Data · MapReduce Live

Designing algorithms that scale across many processing units, centred on the MapReduce model for distributed data processing, with real-world big-data case studies.

Scalable algorithms
MapReduce
Distributed data
Big data

Scalable algorithms minimise communication and synchronization so adding nodes keeps paying off. MapReduce expresses a computation as a map (transform records in parallel) followed by a reduce (aggregate by key), with the framework (Hadoop, Spark) handling distribution, shuffling and fault tolerance. Big data case studies — log analysis, indexing, ETL — show the pattern at web scale.

Key idea — move compute to the data: when datasets are too large to relocate, MapReduce schedules the map tasks on the nodes that already hold each data block, so the network carries only the small reduced results.

Read: Dean & Ghemawat’s original MapReduce paper; Robey & Zamora’s data-parallel-algorithms chapter.

17Live

4.2

Data structures for high performance · visualization Live

Data structures that maximize efficiency in parallel and distributed environments, plus the principles and tools of scientific visualization for interpreting complex HPC datasets.

HP data structures
Scientific visualization
Parallel efficiency
Data interpretation

HP data structures — cache-friendly layouts (structure-of-arrays over array-of-structures), space-filling curves and distributed hash tables keep parallel access contention-free. Scientific visualization — turning multi-gigabyte fields into images and isosurfaces with tools such as ParaView and VisIt, often rendered in parallel in situ while the simulation runs.

Key idea — SoA vs. AoS: storing each field as its own contiguous array (structure-of-arrays) lets the CPU vectorise and the GPU coalesce; interleaving fields per cell (array-of-structures) defeats both. Layout is a performance decision, not a style choice.

Project link: the MiniWeather web page itself is the visualization layer — 2D slices, 3D isosurfaces and an animated GIF of the field.

Read: Robey & Zamora’s data-layout chapter; the ParaView guide for the visualization workflow mirrored in this repo.

18Live

4.3

Fundamentals of distributed machine learning Live

Foundations of distributed ML: data parallelism, model parallelism and strategies for scaling ML across multiple nodes, plus the frameworks and libraries that enable it.

Data parallelism
Model parallelism
Multi-node scaling
ML frameworks

Data parallelism replicates the model on every worker, feeds each a different shard of the batch, and averages gradients with an all-reduce — the most common strategy. Model parallelism splits a model too large for one device across several (tensor and pipeline parallelism). Frameworks — PyTorch DDP/FSDP, Horovod, DeepSpeed — implement these patterns over MPI-style collectives.

Key idea — all-reduce is the heartbeat: synchronous data-parallel training is essentially a giant MPI_Allreduce on the gradients every step, so the same interconnect and collective-algorithm concerns from Module 2 govern how well training scales.

Read: the Horovod and PyTorch-Distributed documentation; survey papers on data vs. model parallelism.

19Live

4.4

Deep learning at scale Live

Scaling deep-learning models on HPC: training large neural networks with gradient accumulation, batch-size scaling and accelerators such as GPUs and TPUs.

Large neural networks
Gradient accumulation
Batch-size scaling
GPUs & TPUs

Gradient accumulation sums gradients over several micro-batches before updating, simulating a large batch on limited memory. Batch-size scaling trades a larger effective batch (more parallelism) against optimisation quality, usually managed with a learning-rate warmup. Accelerators — GPUs and TPUs with mixed-precision (FP16/BF16) tensor cores — supply the raw throughput.

Key idea — the memory wall of training: activations, weights and optimiser state often exceed a single GPU’s memory, which is precisely why gradient accumulation, mixed precision and model-sharding (ZeRO/FSDP) exist.

Read: Goyal et al. “Accurate, Large Minibatch SGD”; the DeepSpeed/ZeRO documentation.

20Async

4.5

Implementing a distributed ML model Async

Apply distributed-ML and deep-learning-at-scale concepts in a hands-on implementation of a distributed machine-learning model.

Distributed ML
Hands-on implementation
Scaling in practice

Bring it together: take a model and dataset, distribute training across multiple GPUs or nodes, and measure the speedup and accuracy trade-offs. The deliverable is a working distributed-training run plus a short scaling analysis — the same strong/weak-scaling discipline from Module 3 applied to ML.

Key idea — scaling efficiency: report throughput (samples/s) and scaling efficiency (speedup ÷ number of devices), not just “it ran on 4 GPUs” — efficiency below ~0.8 usually points to a communication or input-pipeline bottleneck.

distributed-ML project

Read: the framework tutorials selected for your stack (PyTorch DDP / Horovod); revisit Module 4 sessions 18–19.

Module 5

Advanced parallel programming

Hybrid-computing patterns, advanced MPI and OpenMP, and accelerators like FPGAs — applied to climate and terrestrial-systems modelling, the domain MiniWeather itself belongs to. Where Module 2 taught each paradigm in isolation, this module is about composing them and squeezing out the last increments of performance on the largest machines.

By the end of Module 5 you can

Recognise common hybrid-computing patterns and decide how to layer MPI, OpenMP and GPU offload in one application.
Use advanced MPI features — non-blocking and one-sided (RMA) communication, persistent requests, dynamic processes — to overlap and optimise communication.
Apply advanced OpenMP — tasks, SIMD directives and target device offload — beyond simple parallel loops.
Explain where FPGAs and other specialised accelerators win, and select an accelerator for a given workload.
Build a hybrid MPI/OpenMP application for a climate / terrestrial-systems model — exactly MiniWeather’s territory.

21Live

5.1

Patterns in hybrid computing Live

Where multiple parallel paradigms coexist in one application: common hybrid patterns integrating MPI with OpenMP and GPU computing, shown through case studies.

Hybrid patterns
MPI + OpenMP + GPU
Paradigm integration

Hybrid patterns map each paradigm onto the level of hardware it fits best: MPI between nodes, OpenMP among the cores of a node, and CUDA/OpenACC on the GPU within a node. Integration requires care at the seams — e.g. ensuring MPI is initialised for thread support (MPI_THREAD_MULTIPLE) and that GPU buffers can be sent directly (GPU-aware MPI).

Key idea — MPI+X: the dominant pattern on today’s machines is “MPI+X,” where X is OpenMP or CUDA. One MPI rank per GPU (or per socket), threads/kernels inside, and overlap of halo exchange with on-device compute is the recipe behind most exascale codes.

MPIOpenMPCUDA

Project link: MiniWeather's four backends together demonstrate this hybrid stack — MPI across nodes, OpenMP across cores, CUDA on the GPU.

Read: Robey & Zamora’s hybrid-programming chapter; case studies of MPI+OpenMP+CUDA codes (e.g. the MiniWeather and other DOE mini-apps).

22Live

5.2

Advanced MPI Live

Beyond the basics: dynamic process management, one-sided communications and persistent communication requests for optimizing large-scale parallel applications.

Dynamic process mgmt
One-sided comms
Persistent requests

Dynamic process management spawns or connects ranks at runtime for elastic or master/worker workloads. One-sided communication (RMA: MPI_Put/MPI_Get) lets a rank access a neighbour’s memory window without that rank posting a matching receive, decoupling the two. Persistent requests set up a repeated communication pattern once and reuse it, cutting per-message overhead in iterative codes.

Key idea — overlap to hide latency: non-blocking and one-sided calls let computation proceed while messages are in flight; the art of advanced MPI is arranging the code so the network is never the thing everyone waits on.

MPI

Read: Using Advanced MPI (Gropp et al.); the MPI-3/4 standard sections on RMA and persistent collectives.

23Live

5.3

Advanced OpenMP Live

Task-based parallelism, SIMD directives and device-memory management for GPU accelerators — enhancing the performance and scalability of shared-memory applications.

Task parallelism
SIMD directives
Device memory

Task parallelism (#pragma omp task) expresses irregular, recursive or producer/consumer work that doesn’t fit a simple parallel loop. SIMD directives (#pragma omp simd) tell the compiler a loop is safe to vectorise. Device memory & offload (#pragma omp target) move data to and run kernels on a GPU using the same OpenMP model — a portable alternative to CUDA.

Key idea — one model, many devices: OpenMP target offload lets a single annotated source target multicore CPUs and GPUs alike, trading a little peak performance for portability across vendors.

OpenMP

Read: the OpenMP 5.x specification chapters on tasks and device constructs; tutorials on target offload.

24Live

5.4

Accelerators: FPGAs and specialized processors Live

FPGA architecture and programming models, plus other specialized processors in HPC, and how to select the right accelerator for a given workload.

FPGAs
Programming models
Specialized processors
Accelerator selection

FPGAs are reconfigurable chips whose logic you tailor to one algorithm, achieving high performance-per-watt for fixed, streaming dataflow. Programming models — HLS (High-Level Synthesis) and OpenCL — raise FPGA development above raw hardware description. Specialised processors — TPUs, vector engines, AI ASICs — each win on a narrow workload. Selection is a matching problem: data-parallel and floating-point heavy → GPU; fixed, low-latency streaming → FPGA/ASIC.

Key idea — performance per watt: at exascale, power is the binding constraint. Specialised accelerators win not only on raw speed but on energy efficiency, which is why heterogeneous machines dominate the Green500.

Read: Sterling et al., accelerators chapter; vendor HLS / OpenCL-for-FPGA introductions.

25Async

5.5

Climate & terrestrial systems modelling · hybrid MPI/OpenMP Async

Apply advanced parallel programming to climate and terrestrial-systems modelling, integrating MPI and OpenMP and possibly GPU accelerators or FPGAs to enhance performance.

Climate modelling
Terrestrial systems
Hybrid MPI/OpenMP
Accelerators

Apply the whole module to an environmental model: a grid-based atmosphere/land simulation parallelised with hybrid MPI + OpenMP and, optionally, GPU or FPGA acceleration. The exercise mirrors how production weather and climate codes (WRF, E3SM, ICON) are actually built and scaled.

Key idea — stencils are the kernel of climate codes: atmospheric dynamics reduce to repeatedly updating each cell from its neighbours — exactly the 7-point stencil MiniWeather implements — so the parallelisation lessons here transfer directly to real climate modelling.

MPIOpenMPmodelling project

Project link: this is precisely MiniWeather's domain — a mini-weather (climate/atmosphere) stencil solver built with hybrid MPI + OpenMP + CUDA.

Read: the MiniWeather mini-app paper/README; overview articles on WRF or E3SM parallelisation.

Module 6

Frontiers of HPC

The future: neuromorphic and quantum computing, the post-exascale era, a pre-exam review, and the final exam with group-project presentations. Having mastered today’s machines, the module steps back to ask where the field is heading once Moore’s-law scaling of conventional silicon runs out.

By the end of Module 6 you can

Explain the principles of neuromorphic computing and the workloads where event-driven, brain-inspired hardware is efficient.
Define the core quantum-computing concepts — qubits, superposition, gates, entanglement — and name where quantum algorithms promise an advantage.
Discuss the technical and energy challenges of the post-exascale era and the AI–HPC convergence reshaping it.
Synthesise the whole course and connect each module back to the MiniWeather companion project.
Present an end-to-end HPC solution and defend its design, implementation and optimisation choices.

26Live

6.1

Neuromorphic computing Live

Brain-inspired computing: the design and application of neuromorphic systems that mimic neural architectures for efficient sensory processing and machine learning, with their potential and limits.

Neuromorphic systems
Neural architectures
Efficiency
Limitations

Neuromorphic systems (Intel Loihi, SpiNNaker, IBM TrueNorth) implement spiking neural networks directly in hardware, computing only when a neuron “fires.” This event-driven, in-memory style is extraordinarily energy-efficient for sparse, sensory and edge workloads. Limitations — a young software stack and a narrow class of well-suited problems.

Key idea — collapse the memory wall: neuromorphic chips co-locate compute and memory and fire only on events, sidestepping the von Neumann bottleneck that dominates the energy cost of conventional HPC.

Read: survey articles on Loihi/SpiNNaker; Sterling et al., emerging-architectures discussion.

27Live

6.2

Quantum computing fundamentals & applications Live

The basics of quantum computing — qubits, quantum gates, entanglement — plus quantum algorithms and their applications in cryptography, optimization and simulation.

Qubits
Quantum gates
Entanglement
Quantum algorithms

Qubits hold a superposition of 0 and 1; n qubits span a 2ⁿ-dimensional state, the source of quantum parallelism. Gates are reversible unitary operations; entanglement correlates qubits so they can no longer be described independently. Algorithms — Shor (factoring), Grover (search), and quantum simulation — target problems classically intractable, though today’s noisy devices (NISQ) limit practical scale.

Key idea — HPC + quantum, not vs.: the near-term role is hybrid — classical supercomputers orchestrate, pre/post-process and error-mitigate quantum co-processors, much as they orchestrate GPUs today. (The course professor leads the EC HPC & Quantum unit.)

Read: Nielsen & Chuang, introductory chapters; the Qiskit textbook for a hands-on view of gates and algorithms.

28Live

6.3

Advanced computing: toward the post-exascale era Live

Beyond exascale: emerging architectures, software paradigms and the integration of AI and big-data analytics into HPC workflows, and the considerations driving the next generation.

Post-exascale
Emerging architectures
AI + HPC convergence
Future paradigms

Post-exascale — with the first exaflop machines (Frontier, El Capitan, JUPITER in Europe) here, the frontier shifts from peak FLOP/s to energy efficiency, resilience and usability. Emerging architectures — chiplets, near-/in-memory compute, optical interconnects. AI–HPC convergence — simulation and learned surrogate models increasingly run in the same workflow, blurring the line between the two communities.

Key idea — power is the wall: an exascale system already draws ~20–30 MW; the next gains must come from efficiency (specialised hardware, lower precision, smarter algorithms) rather than simply adding more conventional silicon.

Read: EuroHPC and US Exascale Computing Project reports; recent TOP500/Green500 commentary.

29Live

6.4

Review session · pre-exam Live

Synthesize the course: clarify doubts, revisit challenging topics and consolidate knowledge across all modules in preparation for the final examination.

Concept review
Doubt clarification
Synthesis
Exam prep

Consolidate the full arc — from supercomputer anatomy through MPI/OpenMP/CUDA, performance optimization, distributed ML and the frontiers — clarifying doubts and connecting threads across modules in preparation for the final exam.

Key idea — one running example: use MiniWeather as a revision spine — it touches heterogeneous architecture, halo exchange, cache blocking, GPU kernels and visualization, so explaining its four backends rehearses most of the syllabus at once.

Review: the Key concepts glossary below and your per-module assignments.

30Live

—

Exam & group assignments Live

The final session: a written exam plus presentation of group projects, where students demonstrate mastery through the design, implementation and optimization of HPC solutions for real-world problems.

Final written exam
Group presentations
End-to-end HPC solutions

Demonstrate mastery: a written exam covering all six modules plus an oral presentation of the group project — the design, implementation and optimisation of an HPC solution to a real-world problem, with results and a scaling analysis.

group project + final exam

Key concepts

A glossary of the course’s core terms

The recurring vocabulary of high-performance computing, gathered in one place for revision. Each term is defined in a sentence or two; most map directly to a session above and to a part of the MiniWeather companion project.

FLOP/s
Floating-point operations per second — the basic measure of compute throughput; supercomputers are rated in peta- (10¹⁵) and exa- (10¹⁸) FLOP/s.
Flynn’s taxonomy
Classification of machines by instruction/data streams: SISD, SIMD, MISD, MIMD. GPUs are SIMD-like; MPI clusters are MIMD.
Shared memory
A model where all cores access one address space; communication is implicit but synchronization and cache coherence are the cost. OpenMP’s model.
Distributed memory
A model where each node has private memory and shares data by explicit messages. Scales to thousands of nodes. MPI’s model.
Amdahl’s law
For a fixed problem, speedup is capped at 1/(s + p/N) by its serial fraction s — the limit of strong scaling.
Gustafson’s law
If the problem grows with the processor count, useful (weak) scaling continues well past Amdahl’s fixed-size ceiling.
Strong vs. weak scaling
Strong: fix the problem, add processors. Weak: grow the problem with the processors. Real studies report both.
MPI
Message Passing Interface — the standard library for distributed-memory parallelism, via point-to-point and collective communication.
Collective operation
An MPI call involving all ranks at once — broadcast, scatter/gather, reduce, all-reduce — implemented with optimised tree algorithms.
Halo / ghost cells
A border of neighbour-owned cells copied into each rank’s sub-grid every step so a stencil can be applied locally. Central to MiniWeather.
OpenMP
A directive-based API for shared-memory multithreading on multicore CPUs (and, via target, GPUs).
Scheduling (OpenMP)
How loop iterations are split among threads: static (even, low overhead), dynamic/guided (load-balanced, more overhead).
Race condition
A bug where two threads access the same memory without synchronization, giving nondeterministic results; avoided with atomics, critical sections or reductions.
CUDA
NVIDIA’s model for GPU programming: a kernel runs over a grid of thread blocks, with fast shared memory per block.
Warp
A group of 32 GPU threads executing one instruction in lockstep; branch divergence and uncoalesced access within a warp hurt performance.
Coalesced access
Consecutive GPU threads reading consecutive memory addresses, letting the hardware satisfy them in one transaction — a top GPU-performance lever.
OpenACC
A directive-based alternative to CUDA that the compiler turns into accelerator code — more portable, slightly less control.
Memory hierarchy
Registers → L1/L2/L3 cache → DRAM → disk: each level larger but slower. Performance depends on keeping hot data high in it.
Data locality
Temporal (reuse before eviction) and spatial (use neighbouring cache-line data) — the basis of cache-friendly code.
Cache blocking / tiling
Restructuring loops so each working block fits in cache, cutting misses without changing the arithmetic. MiniWeather’s blocked backend.
Vectorization (SIMD)
Packing several array elements into one wide register so a single instruction processes them together.
Roofline model
A plot bounding performance by min(peak FLOP/s, bandwidth × arithmetic intensity), showing whether a kernel is compute- or memory-bound.
Arithmetic intensity
FLOPs performed per byte of memory traffic; low intensity means a kernel is memory-bound, the common case in HPC.
Checkpointing
Periodically saving application state to disk so a long job can restart after a failure instead of from the beginning.
Parallel filesystem
Storage (Lustre, GPFS, BeeGFS) that stripes files across many servers so aggregate I/O bandwidth scales with the cluster.
MapReduce
A big-data model: a parallel map over records followed by a reduce aggregation by key, with the framework handling distribution and faults.
Data vs. model parallelism
Data: replicate the model, split the batch, all-reduce gradients. Model: split a too-large model across devices (tensor/pipeline).
Hybrid (MPI+X)
Layering paradigms by hardware level — MPI between nodes, OpenMP across cores, CUDA on the GPU — the dominant exascale programming pattern.
One-sided communication (RMA)
Advanced MPI where a rank reads/writes a neighbour’s memory window without a matching receive, decoupling the two sides.
FPGA
A reconfigurable chip whose logic is tailored to one algorithm, giving high performance-per-watt for fixed, streaming dataflow.
Exascale
Systems exceeding 10¹⁸ FLOP/s (Frontier, El Capitan); the current frontier, where energy efficiency and resilience dominate design.
Neuromorphic computing
Brain-inspired, event-driven hardware running spiking neural networks with co-located compute and memory for low-energy, sparse workloads.
Qubit
A quantum bit holding a superposition of 0 and 1; n qubits span a 2ⁿ-dimensional state, with entanglement and gates as the operations.

Bibliography

Core & recommended reading

Each title is annotated with what it covers and the sessions it best supports, so the reading list doubles as a study map.

Compulsory
High Performance Computing: Modern Systems and Practices

Sterling, T., Anderson, M., Brodowicz, M., & Bell, C. G. (2018). Morgan Kaufmann, Cambridge, MA. ISBN 9780124201583.

The course’s spine: a comprehensive, accessible treatment of HPC from supercomputer anatomy and architecture through resource management, I/O and emerging hardware — fundamentals plus practical skills for domain scientists.
Sessions 1.1–1.4 · 3.3–3.4 · 5.4 · 6.1–6.3
Recommended
Parallel and High Performance Computing

Robey, R., & Zamora, Y. (2021). Manning Publications. ISBN 9781617296468.

The hands-on companion: evaluating hardware, then writing real code with OpenMP, MPI and GPUs (CUDA/OpenACC), including data layout, profiling and a GPU tsunami simulation — the most directly applicable text for the labs and the MiniWeather backends.
Sessions 1.5 · 2.1–2.5 · 3.1–3.2 · 4.1–4.2 · 5.1
Recommended
Introduction to High Performance Computing for Scientists and Engineers

Hager, G., & Wellein, G. — Taylor & Francis Group (2019). ISBN 9780367221300.

Written by HPC-centre practitioners: the sharpest treatment of single-node performance — architecture, the memory hierarchy, cache behaviour and the optimisation strategies (including the roofline mindset) that drive Module 3.
Sessions 1.2 · 2.1 · 3.1–3.3
Recommended
High Performance Python: Practical Performant Programming for Humans

Gorelick, M., & Ozsvald, I. — O'Reilly Media (2021). ISBN 9781492055020.

Optimising Python for numerical and data-intensive work — profiling, vectorisation with NumPy, multiprocessing and GPU offload — a gentle entry point to performance thinking and useful for the distributed-ML strand.
Sessions 3.1–3.2 · 4.3–4.5