hpc-lab · course structure — High Performance Computing (BCSAI, IE University)

High Performance Computing

Bachelor in Computer Science and Artificial Intelligence (BCSAI) · IE University · syllabus-driven course outline.

This course is a modern introduction to high-performance computing (HPC): parallel coding, distributed systems, and applications in machine and deep learning. It moves from the architecture and resource management of supercomputers, through the core parallel toolchain — MPI, OpenMP and GPU computing — to performance optimization, scalable parallel algorithms, advanced hybrid programming, and the frontiers of neuromorphic and quantum computing.

Across thirty sessions, theory alternates with hands-on practice grounded in real science: health and neurosciences, computational fluid dynamics, molecular and materials simulation, distributed ML, and climate modelling. The companion interactive demos visualize the quantitative heart of the course — Amdahl & Gustafson speedup, strong/weak scaling, the roofline model, the memory hierarchy, Flynn's taxonomy, MPI collectives, OpenMP fork–join, the GPU SIMT model, MapReduce and a qubit on the Bloch sphere.

The professor, Dr. Oscar Diez, leads the HPC & Quantum Technologies Unit at the European Commission and was involved in the EuroHPC Joint Undertaking and the procurement of LUMI, Leonardo and MareNostrum V — three of the fastest supercomputers in the world — so the material is anchored in the systems students may one day actually use.

Program code

HPC-CSAI.3.M.A

Area

Computer Science

Sessions

30 (24 live · 6 async)

Credits

6.0 ECTS

Academic year

25–26

Degree course

Third

Semester

1º

Learning objectives

The course is organised into six modules spanning both the theory and practice of HPC, with an emphasis on parallel computing, distributed systems and machine/deep learning. By the end, students will be able to:

Foundations — explain the evolution, anatomy and architecture of HPC systems, and reason about resource management, scheduling and the performance metrics that govern parallel systems.
Parallel essentials (MPI / OpenMP / GPU) — program distributed-memory systems with MPI, shared-memory systems with OpenMP, and accelerators with CUDA/OpenACC, integrating them for hybrid simulations.
Performance optimization — profile and tune code for HPC architectures, exploiting the memory hierarchy, data locality, vectorization, high-performance libraries and parallel I/O.
Parallel algorithms & ML — design scalable algorithms for big data (MapReduce), choose high-performance data structures, and train machine- and deep-learning models in a distributed setting.
Advanced parallel programming — apply hybrid-computing patterns, advanced MPI and OpenMP features, and accelerators such as FPGAs to large, real-world simulations.
Frontiers of HPC — evaluate emerging paradigms — neuromorphic and quantum computing — and the challenges of the post-exascale era, anticipating the future direction of supercomputing.

Methodology & assessment

IE University's teaching method is collaborative, active and applied. The professor guides students toward the learning objectives through a mix of lectures, in-class exercises and asynchronous field work, group work and individual study — combining theoretical lectures with hands-on practical sessions on real HPC clusters and cloud resources.

Learning activity weighting

Lectures40.0%

60.0 hours

Exercises, async sessions, field work20.0%

30.0 hours

Individual studying16.7%

25.0 hours

Discussions13.3%

20.0 hours

Group work10.0%

15.0 hours

Total: 100% · 150.0 hours

Evaluation criteria

Final exam30%

Comprehensive, all modules (session 30)

Individual work15%

One assignment per module

Intermediate tests15%

Quizzes + intermediate exam

Workgroups15%

Paper + oral presentation

Mid-course exam15%

Written, first half of the course

Class participation10%

Mandatory attendance

AI policy: GenAI tools may be used for research, ideation, generating an outline, proofreading and grammar checking with appropriate acknowledgement. GenAI may not be used for assignments, coding, group submissions or exams; inappropriate use is treated as academic misconduct and may mean failing the assignment or the course. A suggested acknowledgement format is provided in the syllabus.

What each component asks for

Final exam (30%) — a comprehensive written exam at the end of the course (session 30) evaluating both theoretical and applied understanding of all six modules.
Individual work (15%) — each module includes an individual assignment applying lecture concepts to real-world problems or case studies (health/neuroscience, CFD, molecular/materials, distributed ML, climate modelling).
Intermediate tests (15%) — short quizzes administered regularly to evaluate comprehension of recent lectures and reinforce progressive learning.
Workgroups (15%) — a group project: a written paper plus an oral presentation on a selected HPC topic, presented in the final session.
Mid-course exam (15%) — a written individual exam held mid-semester covering the first half of the course (roughly modules 1–3).
Class participation (10%) — active, continuous engagement in discussions and activities; attendance is mandatory and contributes to the grade.

Pass, attendance & re-sit rules

Each student has four chances to pass a course over two consecutive academic years: ordinary-call exams plus extraordinary-call (re-sit) exams in June/July.
Students who do not meet the 80% attendance rule fail both calls for the year (ordinary and extraordinary) and must re-enrol the following year — they do not get a re-sit.
The June/July re-sit is a single comprehensive exam taken in person on campus (Segovia or Madrid); continuous evaluation is not counted, the minimum passing grade is 5, and the maximum obtainable is 8.0 ("notable"). Re-takers (3rd call) may obtain up to 10.0 and must confirm criteria with their assigned professor.
A review session follows grading; attending it is a prerequisite for any grade appeal. Failing more than 18 ECTS in the year after the re-sits leads to being asked to leave the Program.

Program — 6 modules, 30 sessions

Every session of the course, grouped by module. Tags link relevant interactive demos and key readings (Sterling = Modern Systems and Practices, Robey = Parallel and HP Computing, Hager = Intro to HPC for Scientists & Engineers, Barba = High Performance Python). Sessions 5, 10, 15, 20 and 25 are asynchronous application labs; session 30 is the exam & group presentations.

Module 1/6

Foundational Knowledge and Skills

sessions 1–5 · architecture, resources & cloud

The on-ramp to HPC. Students explore the evolution, anatomy and architecture of supercomputers, learn how parallel systems are scheduled and measured, and meet cloud-based HPC and virtualization — then, in an asynchronous lab on health and neuroscience, set up an environment and write their first parallel codes.

You will be able to

Describe the anatomy of a supercomputer and the key performance indicators (FLOP/s, the Top500/LINPACK benchmark, energy efficiency).
Classify parallel architectures using Flynn's taxonomy (SISD/SIMD/MISD/MIMD) and recognise heterogeneous, multiprocessor designs.
Reason about job scheduling, resource allocation and the metrics (speedup, efficiency, scaling) that define efficient use of a cluster.
Explain cloud HPC and virtualization, and set up an environment to run a simple parallel program.

1
Evolution and fundamentals of HPC

Opening session: course structure, assignments, grading and class dynamics, then the core concepts of high-performance computing.
- Course logistics — modules, assessment weights, the GenAI policy, group & individual assignment methodology.
- Evolution of HPC — a brief historical journey from early vector machines to today's exascale clusters.
- Anatomy of a supercomputer — nodes, sockets, cores, interconnect, accelerators, storage.
- Key performance indicators — FLOP/s, peak vs sustained, the Top500 and LINPACK benchmark.
Key idea HPC is the art of organising thousands of processing elements so that they cooperate on one problem — performance comes from parallelism, not from any single fast core.

live Sterling ch. 1–2 — introduction & HPC architecture
2
Architectural overview of HPC systems

A comprehensive look at HPC architecture: the parallel architecture families grounded in Flynn's taxonomy and the technologies that enable supercomputing today.
- Flynn's taxonomy — SISD, SIMD, MISD, MIMD classified by instruction & data streams.
- Architectures & processors — pipelined, superscalar, vector and multicore designs.
- Multiprocessors — SMP, NUMA and cluster organisation.
- Heterogeneous computing — CPU + GPU/accelerator nodes working together.
Core concept — SIMD vs MIMD: SIMD applies one instruction to many data lanes at once (vector units, GPUs); MIMD lets independent cores run different instructions on different data (clusters, multicore) — most real HPC machines combine both.

Key idea Knowing where a machine sits in Flynn's taxonomy tells you which programming model — vectorization, threads or message passing — will actually exploit it.

live Flynn demo ↗ Hager ch. 1–4 — modern architecture & parallel computers
3
Resource management and performance metrics in parallel systems

The operational side of HPC: how shared cluster resources are managed, and the metrics that guide their efficient use.
- Job scheduling — batch schedulers (e.g. Slurm), queues, backfill, fair-share.
- Resource allocation — nodes, cores, memory and walltime requested per job.
- Performance metrics — speedup $S(p)$, efficiency $E(p)=S/p$, strong vs weak scaling.
- Benchmarks — how representative kernels measure a system's real capability.
Core concept — speedup & efficiency: speedup $S(p)=T_1/T_p$ compares serial to $p$-processor time; efficiency $E(p)=S(p)/p$ tells you what fraction of each processor is doing useful work — both fall as overhead and serial bottlenecks grow.

Key idea On a shared cluster, "fast" means more than raw FLOP/s: getting your job scheduled, sized and scaled well is itself a performance skill.

live Scaling demo ↗ Hager ch. 5 — basics of parallelization
4
Introduction to Cloud-based HPC and Virtualization

How cloud computing has changed the accessibility and scalability of HPC: the basics of virtualization, cloud service models, and their benefits and challenges.
- Virtualization — hypervisors, VMs and containers underpinning elastic resources.
- Cloud service models — IaaS / PaaS / SaaS and "HPC-as-a-service".
- Benefits & challenges — elasticity and on-demand scale vs interconnect latency and cost.
- Integrating cloud HPC — bursting scientific workflows from on-prem to cloud.
Key idea The cloud democratises access to HPC, but virtualization adds overhead — for tightly-coupled MPI jobs the interconnect, not the CPU, is often the deciding factor between cloud and a dedicated cluster.

live Sterling ch. 12 — cloud & virtualization for HPC
5
HPC in Health & Neurosciences · Environment Setup and Simple Parallel Codes

Asynchronous application lab: setting up an HPC environment and writing simple parallel codes, with examples from health and neuroscience research. Individual and group assignments are explained here.
- Environment setup — logging into a cluster, modules, compilers and the batch system.
- First parallel codes — a "hello, ranks" MPI program and a parallel loop.
- Domain case studies — neuroimaging pipelines, genomics and biomedical simulation.
- Assignment briefing — scope and expectations for individual and group work.
Key idea Real scientific impact — faster diagnosis, larger brain simulations — is the motivation; the rest of the course is about making those codes run efficiently at scale.

async Robey ch. 1 — why parallel computing?

Module 2/6

Parallel Computing Essentials

sessions 6–10 · memory models, MPI, OpenMP, GPU

The core parallel toolchain. Students master the distinction between distributed and shared memory, then learn the three workhorse models — MPI for message passing, OpenMP for multithreading, and CUDA/OpenACC for GPUs — culminating in an asynchronous CFD lab that integrates MPI and OpenMP into one hybrid program.

You will be able to

Contrast distributed- and shared-memory architectures and pick the right model for a problem.
Write MPI programs using point-to-point and collective communication (broadcast, reduce, scatter/gather).
Parallelise loops with OpenMP directives, handling work-sharing, synchronisation and data scoping.
Offload compute-heavy kernels to a GPU with CUDA/OpenACC and reason about the SIMT execution model.
Combine MPI and OpenMP into a hybrid program for a real CFD simulation.

6
Distributed and shared memory concepts

The core principles of parallel architectures: distributed- vs shared-memory systems, their advantages, challenges and suitable application scenarios.
- Shared memory — one address space, threads communicate by reading/writing memory; needs synchronisation.
- Distributed memory — each process has private memory; communication is explicit message passing.
- Programming implications — how the memory model dictates OpenMP vs MPI vs hybrid.
- Cache coherence & NUMA — why "shared" memory is not uniformly fast.
Core concept — the memory model dictates the programming model: shared memory enables cheap implicit communication but limits scale; distributed memory scales to thousands of nodes but forces you to move data explicitly.

Key idea Almost every large HPC code is hybrid: distributed memory between nodes (MPI) and shared memory within a node (OpenMP/threads).

live Memory demo ↗ Robey ch. 2–3 — planning & the parallel landscape
7
Deep dive into MPI programming

The Message Passing Interface, the cornerstone of scalable distributed-memory computing: basic to intermediate concepts and commands for efficient parallel applications.
- The SPMD model — rank, size, MPI_Init/MPI_Finalize, communicators.
- Point-to-point — MPI_Send / MPI_Recv, blocking vs non-blocking.
- Collective operations — MPI_Bcast, MPI_Reduce, MPI_Scatter/MPI_Gather.
- Data distribution — domain decomposition and communication patterns.
Core concept — communication is the cost: in a distributed program the FLOPs are often free and the messages are expensive, so a good MPI design minimises and overlaps communication rather than computation.

Key idea A collective like a tree-based broadcast reaches $p$ processes in $\lceil\log_2 p\rceil$ steps instead of $p-1$ — choosing collectives over hand-rolled loops is the difference between scaling and stalling.

live MPI demo ↗ Robey ch. 8 — MPI: the parallel backbone
8
OpenMP for multithreading

Shared-memory parallelism with OpenMP: using directives to parallelise code for multicore processors, with work distribution, synchronisation and memory management.
- The fork–join model — #pragma omp parallel spawns a team; threads join at the end.
- Work-sharing — parallel for, schedules (static/dynamic/guided), reductions.
- Data scoping — private, shared, firstprivate and avoiding races.
- Synchronisation — critical, atomic, barrier.
Core concept — incremental parallelism: OpenMP lets you add a directive above a hot loop and parallelise it without rewriting the program — but the burden of correctly scoping shared data and avoiding races is on you.

Key idea OpenMP shines inside a single node where threads share memory; beyond the node boundary you need MPI, which is exactly why hybrid programming exists.

live OpenMP demo ↗ Robey ch. 7 — OpenMP that performs
9
Introduction to GPU Computing · OpenACC / CUDA basics

GPUs as massively parallel accelerators: GPU architecture and programming with CUDA and OpenACC, and how to develop and optimise algorithms that exploit them.
- GPU architecture — streaming multiprocessors, thousands of lightweight threads, the SIMT model.
- The CUDA model — host vs device, kernels, threads/blocks/grids, warps.
- Memory hierarchy on GPU — global, shared and register memory; coalesced access.
- OpenACC — directive-based offload as a gentler on-ramp than raw CUDA.
Core concept — SIMT & warps: a GPU executes threads in lock-step groups (warps); divergent branches within a warp serialise, so GPUs reward regular, data-parallel work and punish irregular control flow.

Key idea GPUs trade per-thread speed for sheer thread count — they win when a problem has thousands of independent, arithmetic-heavy work items and data movement to/from the device is amortised.

live GPU/SIMT demo ↗ Robey ch. 9–12 — GPU architecture & programming
10
Computational Fluid Dynamics (CFD) with HPC · MPI + OpenMP integration

Asynchronous application lab: CFD simulations using MPI and OpenMP together, providing practical experience in hybrid parallel programming.
- CFD & stencils — grid-based PDE solvers with neighbour communication (halo exchange).
- Hybrid decomposition — MPI across nodes, OpenMP threads within each node.
- Halo / ghost cells — exchanging boundary data between subdomains each timestep.
- Building & running — compiling and launching a hybrid job on the cluster.
Key idea Hybrid MPI+OpenMP matches the machine: one MPI rank per node with OpenMP threads on its cores cuts message count and memory while keeping all cores busy.

async Robey ch. 8 + Hager ch. 11 — hybrid MPI/OpenMP

Module 3/6

Performance Optimization in HPC

sessions 11–15 · profiling, memory, I/O, tuning

From "it runs in parallel" to "it runs fast". Students learn to optimise for specific architectures, profile to find bottlenecks, exploit the memory hierarchy and data locality, lean on high-performance libraries, and scale I/O with parallel filesystems — applied in an asynchronous lab on molecular and materials simulation. This is where the roofline model and Amdahl's law become daily tools, and the mid-course exam (~modules 1–3) lands.

You will be able to

Apply Amdahl's and Gustafson's laws to predict and bound the speedup of an optimisation.
Profile CPU and memory behaviour, find hotspots, and add checkpointing for fault tolerance.
Optimise for the memory hierarchy and data locality, and classify a kernel as compute- or memory-bound with the roofline model.
Use high-performance scientific libraries (BLAS/LAPACK, FFTW) and parallel I/O on parallel filesystems.

11
Optimizing code for HPC architectures

Strategies and techniques to fully exploit HPC hardware: code profiling, algorithmic choices, and the optimisation of data structures and access patterns.
- Loop transformations — unrolling, blocking/tiling, fusion for locality.
- Vectorization (SIMD) — helping the compiler use vector units; aligned, unit-stride access.
- Memory access patterns — sequential vs strided; cache-friendly data layout.
- Amdahl's law & scalability limits — the serial fraction caps achievable speedup.
Core concept — Amdahl's law: if a fraction $p$ of the work is parallelisable, speedup on $N$ processors is $S(N)=\dfrac{1}{(1-p)+p/N}$, bounded above by $\dfrac{1}{1-p}$ as $N\to\infty$ — the serial part is the wall you eventually hit.

Key idea Optimise the part that dominates: a 2× speedup on 90% of the runtime beats a 100× speedup on the remaining 10%.

live Amdahl demo ↗ Roofline demo ↗ Scaling demo ↗ Hager ch. 2 + Barba — code optimization & profiling
12
Profiling and performance analysis tools · Checkpointing

Tools and methodologies for performance profiling and analysis (CPU and memory), plus checkpointing techniques to make long-running computations fault-tolerant.
- Profilers — sampling vs instrumentation; hotspot, call-graph and hardware-counter analysis.
- Finding bottlenecks — compute-bound vs memory-bound vs communication-bound.
- Checkpoint/restart — periodically saving state so a job can resume after failure.
- Optimal checkpoint interval — balancing checkpoint cost against expected rework on failure.
Core concept — measure, don't guess: intuition about where time goes is usually wrong; a profiler shows the real hotspots so effort lands where it pays off.

Key idea At scale, failure is normal: a run across thousands of nodes will hit a fault, so checkpointing turns a lost week into a lost hour.

live Hager ch. 2 + Barba — profiling tools
13
Memory hierarchy and data locality · High-performance libraries

The memory hierarchy from registers to disk, strategies to maximise data locality, and the high-performance scientific libraries that provide optimised core operations.
- The hierarchy — registers → L1/L2/L3 cache → DRAM → disk, each ~10× slower and larger.
- Locality — temporal & spatial reuse; cache lines, blocking for cache.
- Roofline model — arithmetic intensity vs bandwidth/peak FLOP/s sets the ceiling.
- HP libraries — BLAS/LAPACK, FFTW, vendor math kernels — don't reinvent the wheel.
Core concept — arithmetic intensity: a kernel's intensity $I$ (FLOPs per byte moved) decides its fate — below the roofline's ridge point $I^\* = \text{peak FLOP/s} / \text{bandwidth}$ it is memory-bound; above it, compute-bound.

Key idea Most scientific kernels are memory-bound: the win comes from moving fewer bytes (locality, reuse) rather than doing fewer FLOPs.

live Memory demo ↗ Roofline demo ↗ Hager ch. 1, 3 — memory hierarchy & data access
14
Parallel I/O and high-throughput data management · Parallel filesystems

As applications scale, I/O becomes critical: parallel I/O operations, the parallel filesystems that support high-throughput access, and techniques for managing large datasets.
- The I/O bottleneck — why naive per-process files cripple a large run.
- Parallel filesystems — Lustre, GPFS/Spectrum Scale: striping data across many storage targets.
- Parallel I/O APIs — MPI-IO, HDF5, NetCDF for coordinated collective writes.
- Data management — layout, chunking and staging for large datasets.
Core concept — collective, striped I/O: a parallel filesystem spreads one file across many storage targets, and collective I/O coordinates processes so the bandwidth aggregates instead of contending.

Key idea At scale, reading and writing data can cost more than computing on it — I/O is a first-class performance concern, not an afterthought.

live Sterling ch. 18 — file systems & parallel I/O
15
Molecular systems & material sciences · Performance tuning on HPC clusters

Asynchronous application lab: applying optimisation techniques to computational chemistry and physics, tuning the performance of simulation codes on HPC clusters.
- Molecular dynamics & DFT — the dominant kernels in materials simulation.
- Tuning workflow — profile → identify bound → optimise → re-measure.
- Scaling studies — measuring strong/weak scaling on the cluster.
Key idea Optimisation is iterative and evidence-driven: a change only counts when the profiler and a scaling run confirm it helped.

async Hager ch. 5–6 — performance tuning at scale

Module 4/6

Parallel Algorithms and Machine Learning

sessions 16–20 · big data, data structures, distributed ML

Where HPC meets data science. Students study scalable algorithms for big data (MapReduce), high-performance data structures and visualization, then the foundations of distributed machine learning and deep learning at scale — building to an asynchronous lab implementing a distributed ML model.

You will be able to

Design scalable parallel algorithms and express data-parallel workloads in the MapReduce model.
Choose and implement data structures that perform well in parallel/distributed settings, and visualise large results.
Distinguish data parallelism from model parallelism and reason about communication in distributed training.
Scale deep-learning training with batch-size scaling, gradient accumulation and accelerators (GPU/TPU).

16
Scalable parallel algorithms for Big Data · MapReduce

Designing algorithms that scale across many processing units, with the MapReduce programming model as a cornerstone of distributed data processing.
- Designing for scale — partitioning, load balancing and minimising data movement.
- The MapReduce model — map → shuffle → reduce; embarrassingly parallel maps, aggregated reduces.
- Data locality in big data — moving computation to the data, not the reverse.
- Beyond MapReduce — where DAG engines (Spark) extend the idea.
Core concept — map / shuffle / reduce: independent map tasks transform records in parallel, a shuffle groups them by key, and reduce aggregates each group — a pattern that scales precisely because the maps need no communication.

Key idea Scalability is designed in, not bolted on: express the computation so the parallel part needs little or no coordination and the framework handles distribution.

live MapReduce demo ↗ Robey ch. 14 — affinity & big-data scaling
17
Data structures for High Performance · Visualization

Designing data structures that maximise efficiency in parallel and distributed environments, and the principles and tools of scientific visualization.
- HP data structures — structure-of-arrays vs array-of-structures; cache- and SIMD-friendly layouts.
- Spatial & sparse structures — trees, grids, sparse matrices for big scientific data.
- Scientific visualization — turning terabyte results into images; in-situ visualization.
Core concept — layout is performance: the same data in structure-of-arrays form streams through cache and vector units far faster than array-of-structures — the data structure choice often beats the algorithm choice.

Key idea A simulation that produces data nobody can interpret has failed; visualization is how HPC results become scientific insight.

live Robey ch. 4 — data design & performance
18
Fundamentals of distributed machine learning

Foundational concepts of distributed ML: data parallelism, model parallelism, and strategies for scaling learning algorithms across many nodes.
- Data parallelism — replicate the model, split the data, synchronise gradients (all-reduce).
- Model parallelism — split a too-large model across devices.
- Parameter servers vs all-reduce — two architectures for aggregating updates.
- Frameworks — distributed training in PyTorch/TensorFlow Horovod-style.
Core concept — data vs model parallelism: data parallelism splits the batch across replicas of one model and reduces gradients; model parallelism splits the model across devices when it can't fit on one — large systems combine both.

Key idea Distributed training is an HPC problem in disguise: the gradient all-reduce is a collective communication, and its cost is what limits how many GPUs you can usefully add.

live Robey ch. 8 — collectives behind all-reduce
19
Deep learning at scale

Scaling deep-learning models on HPC: training large neural networks with gradient accumulation, batch-size scaling, and accelerators such as GPUs and TPUs.
- Large-batch training — scaling batch size and learning-rate warmup to use more devices.
- Gradient accumulation — simulating a large batch when memory is limited.
- Mixed precision — FP16/bfloat16 to boost throughput on tensor cores.
- Accelerators — GPU and TPU characteristics for training.
Core concept — strong scaling for training: adding GPUs shortens time-to-train only while communication stays small relative to compute — past that point, gradient synchronisation dominates and efficiency falls, an Amdahl's-law story in ML clothing.

Key idea Training a modern model is a supercomputing workload: the same scaling, memory-hierarchy and interconnect concerns from earlier modules decide whether it finishes in hours or weeks.

live GPU/SIMT demo ↗ Robey ch. 9–12 — accelerating with GPUs
20
Implementing a Distributed ML Model

Asynchronous application lab: applying distributed-ML and deep-learning-at-scale concepts to implement a distributed machine-learning model end to end.
- Project scaffold — dataset, model, and a multi-GPU/multi-node training loop.
- Distributed data loading — sharding the dataset across workers without overlap.
- Synchronisation & checkpoints — gradient all-reduce and saving model state.
- Measuring scaling efficiency — samples/sec vs number of devices.
Key idea This is the synthesis lab of the module: parallel algorithms, the memory hierarchy, collectives and accelerators all come together to train one model faster.

async Robey + framework docs — distributed training

Module 5/6

Advanced Parallel Programming

sessions 21–25 · hybrid patterns, advanced MPI/OpenMP, FPGAs

The expert tier. Students study patterns for hybrid computing, advanced features of MPI (one-sided, dynamic processes) and OpenMP (tasks, SIMD, device offload), and accelerators such as FPGAs — consolidating it all in an asynchronous lab on climate and terrestrial-systems modelling with hybrid MPI/OpenMP.

You will be able to

Combine MPI, OpenMP and GPU computing using established hybrid-computing patterns.
Use advanced MPI: dynamic process management, one-sided (RMA) communication and persistent requests.
Use advanced OpenMP: task-based parallelism, SIMD directives and device (GPU) memory management.
Assess FPGAs and other specialised processors and pick the right accelerator for a workload.

21
Patterns in hybrid computing

The world of hybrid computing, where different parallel paradigms coexist within one application: common patterns combining MPI with OpenMP and GPU computing.
- MPI + OpenMP — ranks across nodes, threads within; mapping to the machine topology.
- MPI + GPU — one rank drives a device; GPU-aware MPI for direct transfers.
- Overlap patterns — hiding communication behind computation.
- Choosing a pattern — matching the paradigm mix to problem and hardware.
Core concept — match the model to the machine: modern nodes are hierarchical (many cores, one or more GPUs, a fast interconnect), so the fastest codes use a mix of paradigms, each at the level it fits best.

Key idea There is no single parallel model that wins everywhere — fluency means composing MPI, OpenMP and GPU code into one coherent program.

live Hager ch. 11 — hybrid parallelization patterns
22
Advanced MPI

Advanced MPI features and techniques: dynamic process management, one-sided communications, and persistent communication requests for optimising large-scale applications.
- One-sided / RMA — MPI_Put/MPI_Get into remote memory windows.
- Dynamic processes — MPI_Comm_spawn to grow/shrink the job at runtime.
- Persistent requests — pre-initialised sends/recvs that cut per-message overhead.
- Communicators & topologies — Cartesian/graph topologies that match the algorithm.
Core concept — one-sided communication: RMA lets one process read or write another's memory without the remote process explicitly participating, decoupling communication from synchronisation and exposing more overlap.

Key idea At extreme scale, shaving per-message overhead and overlapping communication with computation is where the last factors of performance hide.

live Robey ch. 8 — advanced MPI
23
Advanced OpenMP

Advanced OpenMP: task-based parallelism, SIMD directives, and managing device memory in the context of GPU accelerators.
- Task parallelism — #pragma omp task for irregular, recursive and unbalanced work.
- SIMD directives — #pragma omp simd to force vectorization.
- Device offload — target regions and map clauses for GPUs.
- NUMA-aware threading — first-touch placement and thread affinity.
Core concept — tasks vs loops: work-sharing loops fit regular, countable iterations; OpenMP tasks express irregular or recursive parallelism (tree walks, producer/consumer) that loops cannot.

Key idea Modern OpenMP is no longer just "parallel for": tasks, SIMD and target offload let one directive set span CPU vectorization through GPU acceleration.

live Robey ch. 7 — advanced OpenMP
24
Accelerators: FPGAs and specialized processors

Field-Programmable Gate Arrays and other specialised processors as HPC accelerators: their architecture, programming models, and how to choose the right accelerator for a workload.
- FPGA architecture — configurable logic blocks wired into a custom datapath per problem.
- Programming FPGAs — HLS (C/OpenCL → hardware) vs traditional HDL.
- Other specialised processors — TPUs, DSPs, AI/inference ASICs.
- Choosing an accelerator — throughput, latency, power and flexibility trade-offs.
Core concept — spatial vs temporal computing: a CPU/GPU streams data through fixed units over time; an FPGA lays the computation out in space as a custom pipeline, trading flexibility for extreme efficiency on the right problem.

Key idea The future is heterogeneous: matching each kernel to the accelerator that suits it (GPU, FPGA, TPU) is becoming a core HPC skill.

live Sterling ch. 15 — accelerator architectures
25
Climate and terrestrial systems modelling · Hybrid MPI/OpenMP

Asynchronous application lab: a real climate/terrestrial-systems modelling problem, integrating MPI and OpenMP (and possibly GPUs or FPGAs) to enhance performance.
- Earth-system models — coupled atmosphere/ocean/land grids, a flagship HPC workload.
- Hybrid implementation — MPI domains + OpenMP threads, optional accelerator offload.
- Scaling & validation — measuring scaling and verifying physical correctness.
Key idea Climate modelling is among the most demanding HPC applications — getting it to scale ties together every advanced technique in the module and shows why exascale matters.

async Hager ch. 11 + Robey — large-scale hybrid codes

Module 6/6

Frontiers of HPC

sessions 26–30 · neuromorphic, quantum, post-exascale, exam

The horizon. The course closes by looking beyond conventional supercomputing to neuromorphic and quantum computing and the challenges of the post-exascale era, then consolidates everything in a pre-exam review before the final exam and group presentations.

You will be able to

Explain neuromorphic computing and where brain-inspired hardware is efficient.
State the basics of quantum computing (qubits, gates, entanglement) and where quantum algorithms could matter.
Discuss the technical and scientific challenges of moving beyond exascale.
Synthesise the whole course and demonstrate mastery in the final exam and group project.

26
Neuromorphic computing

Brain-inspired computing: the design and application of neuromorphic systems that mimic neural architectures for high efficiency on specific tasks such as sensory processing and ML.
- Spiking neural networks — event-driven computation instead of clocked dense math.
- Neuromorphic hardware — Loihi, SpiNNaker and similar architectures.
- Where it wins — ultra-low-power sensory and edge inference.
- Limitations — programmability and a still-maturing software stack.
Core concept — compute like a brain: neuromorphic chips co-locate memory and computation and fire only on events (spikes), sidestepping the von Neumann bottleneck that limits conventional processors on certain workloads.

Key idea Energy, not just speed, is the next frontier — neuromorphic designs target orders-of-magnitude efficiency on the tasks they fit.

live Sterling ch. 20 — future & emerging architectures
27
Quantum computing fundamentals and applications

The basics of quantum computing — qubits, quantum gates and entanglement — and quantum algorithms with applications in cryptography, optimization and simulation.
- Qubits & superposition — a state $|\psi\rangle = \alpha|0\rangle + \beta|1\rangle$ with $|\alpha|^2+|\beta|^2=1$.
- Quantum gates — reversible unitary operations (Hadamard, CNOT) on the Bloch sphere.
- Entanglement — correlations with no classical analogue.
- Algorithms & applications — Shor (factoring), Grover (search), quantum simulation.
Core concept — exponential state space: $n$ qubits span a $2^n$-dimensional state, and gates act on all amplitudes at once — the source of potential quantum advantage for problems like factoring and simulation.

Key idea Quantum won't replace classical HPC; it is a complementary accelerator for a narrow class of problems — exactly the EuroHPC-style hybrid quantum/classical vision this course is anchored in.

live Qubit demo ↗ Sterling ch. 20 — quantum & beyond
28
Advanced computing: toward the post-exascale era

Looking past the exascale milestone: emerging architectures, software paradigms, and the integration of AI and big-data analytics into HPC workflows.
- What exascale means — $10^{18}$ FLOP/s; LUMI, Leonardo, MareNostrum V, Frontier-class systems.
- Post-exascale challenges — power, reliability, data movement and programmability walls.
- AI + HPC convergence — "HPC-AI" workflows blending simulation and learning.
- Future software paradigms — performance portability across diverse accelerators.
Core concept — the walls of scale: beyond exascale, progress is limited less by raw transistors than by power, resilience and the cost of moving data — so the next gains come from architecture and software, not clock speed.

Key idea The future of supercomputing is heterogeneous and AI-infused — anticipating that direction is exactly the competency this course is built to give.

live Sterling ch. 19–20 — exascale & beyond
29
Review session — pre-exam

A course-wide review consolidating the six modules: clarifying doubts, discussing challenging topics, and synthesising the material in preparation for the final exam and group assignments.
- Cross-module synthesis — how architecture, parallelism, optimisation and ML connect.
- Worked-example review — Amdahl/Gustafson, roofline, scaling, collectives, AMAT.
- Q&A and exam guidance — format, scope and study strategy.
Key idea The course is a single arc from "what is a supercomputer" to "where is computing going" — the review is where the through-line becomes clear.

live revision
30
Exam and Group Assignments

The final session: a comprehensive written exam plus the presentation of group projects, where students design, implement and optimise HPC solutions for real-world problems.

Assessment The final exam is the single largest component (30%) and the group presentation completes the 15% workgroups grade. The June/July re-sit, if needed, is one comprehensive exam graded out of a maximum of 8.0.

live final exam · 30% group presentation · 15%

Worked examples

Short, hand-computed examples for the quantitative core of the course — the speedup laws, parallel efficiency, the roofline model, the memory hierarchy and collective communication. These are exactly the calculations the demos animate and the exam asks for. (FLOP/s = floating-point operations per second; B = bytes.)

scaling Amdahl's law — a 95%-parallel code

A program spends 95% of its serial runtime in parallelisable work ($p = 0.95$, serial fraction $1-p = 0.05$). What speedup do 16 cores give, and what is the hard ceiling as the core count grows without bound? Amdahl's law: $$S(N) = \frac{1}{(1-p) + \dfrac{p}{N}}.$$

16 cores: $S(16) = \dfrac{1}{0.05 + 0.95/16} = \dfrac{1}{0.05 + 0.0594} = \dfrac{1}{0.1094} \approx \mathbf{9.14\times}$
limit: as $N\to\infty$, $p/N \to 0$, so $S_{\max} = \dfrac{1}{1-p} = \dfrac{1}{0.05} = \mathbf{20\times}$
reality check: 16 cores already reach $9.14/20 \approx 46\%$ of the absolute ceiling

Even with infinite hardware this code can never exceed 20×, because 5% of the work is stubbornly serial. Amdahl's law is the pessimist's view: for a fixed problem, the serial fraction dominates as $N$ grows. This is why session 11 stresses attacking the serial bottleneck before buying more cores.

scaling Gustafson's law — the optimist's contrast

Amdahl fixes the problem size; Gustafson asks instead how big a problem you can solve in fixed time if more processors let you scale the work up (weak scaling). With serial fraction $s = 1-p$ measured on the parallel run, the scaled speedup is $$S(N) = N - s\,(N - 1) = s + p\,N.$$

same 5% serial, 16 procs: $S(16) = 16 - 0.05\,(16 - 1) = 16 - 0.75 = \mathbf{15.25\times}$
compare Amdahl: fixed-size Amdahl gave only $9.14\times$ on the same 16 cores
at 1024 procs: $S = 1024 - 0.05(1023) = 1024 - 51.15 \approx \mathbf{972.9\times}$ — grows ~linearly

Because larger machines are usually used to solve larger problems (finer grids, more particles), Gustafson's near-linear scaled speedup is often the realistic story for HPC — the serial fraction stays a small constant rather than a growing share. Strong scaling (fixed problem) follows Amdahl; weak scaling (growing problem) follows Gustafson.

metrics Parallel efficiency & the Karp–Flatt serial fraction

A code is measured at speedup $S = 12$ on $p = 16$ processors. What is its parallel efficiency, and what serial fraction does that imply? The Karp–Flatt metric recovers the experimentally determined serial fraction $e$ from a measured speedup: $$E = \frac{S}{p}, \qquad e = \frac{1/S - 1/p}{1 - 1/p}.$$

efficiency: $E = 12/16 = 0.75 = \mathbf{75\%}$ — each processor does 75% useful work
Karp–Flatt: $e = \dfrac{1/12 - 1/16}{1 - 1/16} = \dfrac{0.08333 - 0.0625}{0.9375} = \dfrac{0.02083}{0.9375} \approx \mathbf{0.0222}$
interpret: ~2.2% of the work behaves as serial/overhead at this scale

Karp–Flatt is diagnostic: if $e$ rises as you add processors, the loss is growing overhead (communication, load imbalance), not just an irreducible serial section. A flat, small $e$ with falling efficiency points instead at Amdahl's fixed serial fraction. This is the first thing to compute after a disappointing scaling run (session 12).

roofline Roofline ridge point — memory- vs compute-bound

A node peaks at 2 TFLOP/s = $2\times10^{12}$ FLOP/s with 200 GB/s = $200\times10^{9}$ B/s of memory bandwidth. The roofline caps attainable performance at $\min(\text{peak FLOP/s},\; I \times \text{bandwidth})$, where arithmetic intensity $I$ is FLOPs per byte moved. The ridge point is where the two ceilings meet: $$I^{*} = \frac{\text{peak FLOP/s}}{\text{bandwidth}}.$$

ridge: $I^{*} = \dfrac{2\times10^{12}}{200\times10^{9}} = \mathbf{10}$ FLOP/byte
kernel A — DAXPY ($a x + y$): 2 FLOPs per ~24 B moved, $I \approx 0.083$ → far left of ridge → memory-bound; ceiling $= 0.083 \times 200\,\text{GB/s} \approx \mathbf{16.6\ \text{GFLOP/s}}$ (≈0.8% of peak)
kernel B — dense GEMM: $I \approx 30$ FLOP/byte → right of ridge → compute-bound, can approach the 2 TFLOP/s roof

Below $I^{*}=10$ a kernel is starved by bandwidth no matter how fast the cores are — the fix is to move fewer bytes (blocking, reuse, better layout), not to add FLOPs. Above it, the cores are the limit. Most scientific kernels live to the left of the ridge, which is why sessions 13–14 focus relentlessly on the memory hierarchy and data locality.

memory Average memory access time for a 3-level cache

A core has L1 (1 cycle), L2 (12 cycles), L3 (40 cycles) and DRAM (200 cycles). Local miss rates are L1 10%, L2 5%, L3 2%. Average memory access time chains the hierarchy: $$\text{AMAT} = t_{L1} + m_{L1}\big(t_{L2} + m_{L2}\,(t_{L3} + m_{L3}\,t_{\text{DRAM}})\big).$$

innermost: $t_{L3} + m_{L3}\,t_{\text{DRAM}} = 40 + 0.02 \times 200 = 40 + 4 = 44$ cycles
L2 level: $t_{L2} + m_{L2}\times 44 = 12 + 0.05 \times 44 = 12 + 2.2 = 14.2$ cycles
L1 level: $\text{AMAT} = 1 + 0.10 \times 14.2 = 1 + 1.42 = \mathbf{2.42}$ cycles

Despite DRAM costing 200 cycles, good locality keeps the average access near 2.4 cycles because most accesses are caught in L1. Halving the L1 miss rate to 5% drops AMAT to $1 + 0.05\times14.2 \approx \mathbf{1.71}$ cycles — concrete evidence that improving locality beats raw clock speed, the central lesson of session 13.

MPI Collective broadcast — tree vs linear steps

A root must send the same message to all other processes. A naive linear broadcast sends it one process at a time; a tree (recursive-doubling) broadcast doubles the number of senders each step. For $p$ processes the step counts are $$T_{\text{linear}} = p - 1, \qquad T_{\text{tree}} = \lceil \log_2 p \rceil.$$

processes p	linear (p−1)	tree ⌈log₂p⌉	speedup
8	7	3	2.3×
64	63	6	10.5×
1024	1023	10	102×
1,048,576	1,048,575	20	~52,000×

The tree turns a linear $O(p)$ cost into a logarithmic $O(\log_2 p)$ one: at a million ranks it is the difference between 20 steps and a million. This is precisely why session 7 insists on using the MPI collective (MPI_Bcast, MPI_Reduce) — the library already implements the optimal tree — rather than hand-coding a loop of point-to-point sends.

Key concepts

A quick-reference glossary of the core terms used across the course, in roughly the order they appear. Useful for revision before the mid-course exam and the final.

Speedup S(p): Ratio of serial to parallel runtime, $S = T_1/T_p$; how many times faster $p$ processors finish the job.
Parallel efficiency: $E = S/p$ — the fraction of each processor's time spent on useful work; falls as overhead grows.
Amdahl's law: Fixed-size speedup $S(N)=1/((1-p)+p/N)$, capped at $1/(1-p)$ by the serial fraction.
Gustafson's law: Scaled (weak-scaling) speedup $S(N)=N - s(N-1)$; near-linear when the problem grows with $N$.
Karp–Flatt metric: Experimentally determined serial fraction $e$ recovered from a measured speedup; diagnoses overhead vs serial limits.
Strong vs weak scaling: Strong: fixed problem, more processors (Amdahl). Weak: problem grows with processors, fixed work each (Gustafson).
Flynn's taxonomy: Classifies machines by instruction/data streams: SISD, SIMD, MISD, MIMD.
SIMD / MIMD: One instruction over many data lanes (vector/GPU) vs many independent instruction streams (multicore/cluster).
Shared vs distributed memory: One address space with implicit communication vs private per-process memory with explicit message passing.
MPI: Message Passing Interface — the standard for distributed-memory parallelism via point-to-point and collective communication.
Collective operation: A communication involving a whole group — broadcast, reduce, scatter/gather, all-reduce.
OpenMP: A directive-based API for shared-memory multithreading on multicore CPUs.
Fork–join: OpenMP's model: a parallel region forks a team of threads that rejoin at the end.
CUDA / SIMT: NVIDIA's GPU model; threads execute single-instruction, multiple-thread in lock-step groups.
Warp: The group of GPU threads (typically 32) scheduled together; branch divergence within a warp serialises.
Arithmetic intensity: FLOPs performed per byte of memory moved; determines whether a kernel is compute- or memory-bound.
Roofline model: Plots attainable FLOP/s vs intensity; the ridge point $I^{*}=\text{peak}/\text{bandwidth}$ separates memory- from compute-bound.
AMAT: Average memory access time — the hierarchy's effective latency, weighting each level by its miss rate.
Data locality: Reusing data already in cache (temporal) and accessing neighbours (spatial) to cut slow memory traffic.
Checkpointing: Periodically saving program state so a long run can restart after a node failure.
MapReduce: A big-data model: parallel map over records, a shuffle by key, then reduce aggregation.
Data vs model parallelism: Split the batch across model replicas (sync gradients) vs split one large model across devices.
FPGA: Field-Programmable Gate Array — reconfigurable logic forming a custom spatial datapath per problem.
Neuromorphic computing: Brain-inspired, event-driven hardware (spiking neural networks) targeting extreme energy efficiency.
Qubit: A quantum bit, $\alpha|0\rangle+\beta|1\rangle$; $n$ qubits span a $2^n$-dimensional state space.
Exascale: $10^{18}$ FLOP/s; the performance class of LUMI, Leonardo, MareNostrum V and Frontier-era machines.

Bibliography

The syllabus reading list (all available digitally) — one compulsory text plus three recommended. Each entry notes which sessions it best supports.

High Performance Computing: Modern Systems and Practices compulsory

Thomas Sterling, Matthew Anderson, Maciej Brodowicz & Gordon Bell · Morgan Kaufmann, 2018 · ISBN 9780124201583 (Digital)

A comprehensive, accessible treatment of HPC covering fundamental concepts and essential skills — from system architecture through parallel I/O to the future of the field. The primary text and the spine of the architecture, systems and frontiers material.

supports sessions 1–2, 4, 14, 24, 26–28 (Modules 1, 3, 5, 6)
Parallel and High Performance Computing recommended

Robert Robey & Yuliana Zamora · Manning Publications, 2021 · ISBN 9781617296468 (Digital)

A hands-on guide to boosting code effectiveness: evaluating hardware, working with industry-standard tools (OpenMP, MPI), choosing HP data structures and algorithms, and running real GPU simulations. The go-to reference for the parallel-programming and ML modules.

supports sessions 5–10, 16–23 (Modules 2, 4, 5)
Introduction to High Performance Computing for Scientists and Engineers recommended

Georg Hager & Gerhard Wellein (Taylor & Francis Group) · 2019 · ISBN 9780367221300 (Digital)

Written by HPC practitioners, a solid introduction to mainstream architecture, dominant parallel programming models and optimization strategies for scientific computing — strongest on the memory hierarchy, optimization and hybrid programming.

supports sessions 2–3, 11–15, 21, 25 (Modules 1, 3, 5)
High Performance Python: Practical Performant Programming for Humans recommended

Micha Gorelick & Ian Ozsvald — listed in the syllabus as Barba, L. A. & Forsyth, G. · O'Reilly Media, 2021 · ISBN 9781492055020 (Digital)

Focuses on optimizing Python for performance, with emphasis on numerical algorithms and data-intensive applications — a practical resource for profiling and accelerating the kind of Python used in scientific and ML workflows.

supports sessions 11–12, 19–20 (profiling, optimization & distributed ML in Python)