Example project — parallelising a scientific workload, end to end
This page is a worked, start-to-finish HPC case study: it takes one concrete scientific kernel and carries it through every stage the syllabus's group project asks for — "design, implementation, and optimization of HPC solutions for real-world problems". We pick a problem that is relatable, physically meaningful, and structurally rich enough to exercise the whole course: a 2-D heat-diffusion simulation solved with a 5-point Jacobi stencil on a regular grid.
The kernel is ideal because it is embarrassingly parallel at the level of a single cell, yet has real communication structure the moment you split the grid across processors: each tile needs a one-cell-thick border of its neighbours' data (the halo) every step. That single fact is what turns a toy into a genuine HPC problem — it forces us through Amdahl's law, OpenMP threading, MPI domain decomposition and halo exchange, GPU offload, the roofline model, and even the distributed-ML module, because a gradient all-reduce is the same communication pattern as a halo exchange. Every stage below cross-links to the matching live demo on the demos page.
The nine stages of the project, each a section below — with its live demo anchor:
1 · Problem & baseline
Heat spreads through a plate over time. The governing PDE is the heat equation, which says the rate of change of temperature at a point is proportional to how curved the temperature field is there (its Laplacian):
$$ \frac{\partial u}{\partial t} = \alpha \nabla^2 u = \alpha\!\left(\frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}\right) $$
We discretise space into an $N \times N$ grid with spacing $h$ and march in time with step $\Delta t$. Replacing each second derivative with the standard central difference and choosing the timestep so the scheme is the simplest stable explicit one ($\alpha\,\Delta t / h^2 = \tfrac14$) collapses the update to a clean average of the four orthogonal neighbours — the 5-point Jacobi stencil:
$$ u^{t+1}_{i,j} = \tfrac14\left(u^{t}_{i-1,j} + u^{t}_{i+1,j} + u^{t}_{i,j-1} + u^{t}_{i,j+1}\right) $$
Every cell at step $t{+}1$ reads only the previous step's values, so the entire grid can be updated independently — the source of all the parallelism we are about to exploit. We keep two buffers and swap them (ping-pong) so reads and writes never alias. Convergence is declared when the largest change between steps, the residual $r = \max_{i,j}\,|u^{t+1}_{i,j} - u^{t}_{i,j}|$, falls below a tolerance.
jacobi.py — serial NumPy baseline
import numpy as np def jacobi(u, tol=1e-4, max_iter=20000): # u: (N, N) float64 grid; fixed Dirichlet values on the border. new = u.copy() for it in range(max_iter): # vectorised 5-point average of the interior cells new[1:-1, 1:-1] = 0.25 * ( u[:-2, 1:-1] + u[2:, 1:-1] + # north + south u[1:-1, :-2] + u[1:-1, 2:]) # west + east res = np.max(np.abs(new[1:-1, 1:-1] - u[1:-1, 1:-1])) u, new = new, u # ping-pong the buffers if res < tol: break return u, it, res
Where the time goes. Before optimising anything, profile.
Under cProfile the whole runtime sits inside jacobi(); switching to
line_profiler (the @profile decorator + kernprof -l) attributes it to a
single line — the stencil average. That line is the hot loop: roughly 96–98% of wall-clock for
a large grid. Everything else (the residual reduction, the buffer swap, the convergence test) is the
serial remainder.
$ kernprof -l -v jacobi.py Line # % Time Line Contents ========================================= 7 0.4 new = u.copy() 9 96.8 new[1:-1,1:-1] = 0.25 * ( ... ) # <-- the hot loop 12 2.4 res = np.max(np.abs(new - u)) 13 0.3 u, new = new, u 14 0.1 if res < tol: break
The arithmetic per cell is tiny — 4 additions and 1 multiply — but each cell touches 4 neighbours plus itself. That ratio of flops to bytes moved is what will make this kernel memory-bound, a conclusion we will pin down precisely in the performance analysis via the roofline demo.
2 · Is it worth parallelising?
Never parallelise on faith — estimate the ceiling first. Of the total work, the fraction that can run in parallel is $p$; the rest, $s = 1 - p$, is stubbornly serial. From the profile the stencil sweep is the parallel part. Being conservative and folding the residual reduction's setup into the serial share, take $p = 0.97$, so $s = 0.03$. Amdahl's law bounds the speedup on $P$ processors when the problem size is fixed:
$$ S_{\text{Amdahl}}(P) = \frac{1}{s + \dfrac{p}{P}} = \frac{1}{(1-p) + \dfrac{p}{P}} \qquad\xrightarrow[P\to\infty]{}\qquad \frac{1}{s} $$
Plug $s = 0.03$, $p = 0.97$ into $S(P) = 1/(0.03 + 0.97/P)$ for a fixed $4096^2$ grid:
| P (cores) | denominator | speedup S | efficiency S/P |
|---|---|---|---|
| 8 | 0.03 + 0.121 = 0.151 | 6.6× | 83% |
| 16 | 0.03 + 0.061 = 0.091 | 11.0× | 69% |
| 64 | 0.03 + 0.015 = 0.045 | 22.1× | 35% |
| 256 | 0.03 + 0.0038 = 0.0338 | 29.6× | 12% |
| ∞ | 0.03 | 33.3× | → 0% |
Even with infinite cores a 3% serial fraction caps us at 33.3×. By 64 cores we are already at one-third efficiency — the serial residual test is becoming the wall. Takeaway: shaving the serial part (a parallel reduction instead of a global one) matters more than adding cores.
But the picture changes if the problem grows. In real HPC we rarely keep the grid fixed — we use more cores to solve a bigger problem in the same time. That is Gustafson's law, which measures scaled speedup and is far more optimistic:
$$ S_{\text{Gustafson}}(P) = s + p\cdot P = P - s\,(P - 1) $$
With $s = 0.03$, $p = 0.97$: $S(P) = 0.03 + 0.97P$. Each core handles its own tile, so the parallel work grows with $P$ while the serial part stays roughly constant.
| P (cores) | 0.03 + 0.97·P | scaled speedup | efficiency |
|---|---|---|---|
| 8 | 0.03 + 7.76 | 7.8× | 97% |
| 64 | 0.03 + 62.1 | 62.1× | 97% |
| 256 | 0.03 + 248.3 | 248.3× | 97% |
Under weak scaling the efficiency holds near 97% all the way up. Reconciliation: Amdahl answers "faster on the same grid?" (limited), Gustafson answers "bigger grid in the same time?" (scales). Our stencil is a Gustafson workload — we grow $N$ with the core count, so the project targets weak scaling.
3 · Shared-memory parallelism with OpenMP
The first real implementation step targets a single multi-core node. All threads share the same
grid in memory, so we do not move data — we just divide the rows of the sweep among threads. In C, one
#pragma omp parallel for over the outer loop expresses the whole thing; OpenMP forks a team of
threads, splits the iteration space, and joins at the end of the loop. This is the
SIMD/SPMD idea: many threads, one instruction stream, different data. Compare
shared-memory vs message-passing models in the Flynn's taxonomy demo.
jacobi_omp.c — threaded stencil sweep
#pragma omp parallel for collapse(2) schedule(static) \ reduction(max:res) for (int i = 1; i < N-1; i++) { for (int j = 1; j < N-1; j++) { double v = 0.25 * (u[i-1][j] + u[i+1][j] + u[i][j-1] + u[i][j+1]); double d = fabs(v - u[i][j]); if (d > res) res = d; /* combined by reduction(max) */ new[i][j] = v; } } /* swap u and new pointers, then test res < tol */
Scheduling. For a uniform stencil every row costs the
same, so schedule(static) — a fixed, contiguous block of rows per thread — is exactly right: it
has zero scheduling overhead and keeps each thread working on a contiguous, cache-friendly slab.
schedule(dynamic) would only pay off if cell costs were irregular (e.g. an adaptive grid); here
its per-chunk bookkeeping is pure overhead.
The sweep updates $N - 2 = 4094$ interior rows. schedule(static) hands each of 8
threads $\lfloor 4094/8 \rfloor = 511$ rows, with the first $4094 \bmod 8 = 6$ threads taking one extra.
- threads 0–5: 512 rows each → 6 × 512 = 3072 rows
- threads 6–7: 511 rows each → 2 × 511 = 1022 rows
- total: 3072 + 1022 = 4094 rows ✓, max imbalance = 1 row = 0.2%
Imbalance under 1% means the static split is essentially perfect for this regular grid — no reason to reach for dynamic scheduling. The slowest thread sets the step time, so 0.2% imbalance costs us at most 0.2% of the parallel speedup.
Two shared-memory hazards the project must address explicitly:
- False sharing. Two threads writing to different cells that happen to fall on the same 64-byte cache line force that line to bounce between cores. Because we partition by whole rows (large contiguous blocks), the shared boundary is just one cache line per thread pair — negligible. Splitting by column would have been a disaster.
- NUMA & first-touch. On a multi-socket node, physical pages are placed on the socket of
the thread that first writes them. So we initialise the grid with the same
parallel forpartitioning we use for the sweep — each thread first-touches the rows it will own, placing that memory on its local socket and avoiding slow cross-socket reads every step.
4 · Distributed-memory parallelism with MPI
One node runs out of cores and memory eventually. To use a whole cluster, each process owns a private slice of the grid in its own address space and the only way to share data is to send messages. We split the $N \times N$ grid by domain decomposition into a $\sqrt{P} \times \sqrt{P}$ array of tiles, one per MPI rank. A rank can update its interior alone — but the cells on a tile's edge need the neighbouring tile's edge values, which live in another process's memory.
Halo / ghost cells. Each tile is padded with a one-cell-thick border — the halo — that holds a copy of the neighbour's edge. Before every Jacobi step, neighbours exchange these borders. After the exchange, every cell has all four neighbours locally and the sweep proceeds exactly as in the serial code. This is the heartbeat of the program.
jacobi_mpi.c — one timestep on each rank
/* 1. exchange halos with the 4 cartesian neighbours (non-blocking) */ MPI_Irecv(halo_top, N_loc, MPI_DOUBLE, nbr_up, 0, comm, &r[0]); MPI_Irecv(halo_bottom, N_loc, MPI_DOUBLE, nbr_down, 1, comm, &r[1]); MPI_Isend(row_first, N_loc, MPI_DOUBLE, nbr_up, 1, comm, &r[2]); MPI_Isend(row_last, N_loc, MPI_DOUBLE, nbr_down, 0, comm, &r[3]); /* ... same for left/right neighbours ... */ MPI_Waitall(8, r, MPI_STATUSES_IGNORE); /* 2. local Jacobi sweep over this rank's tile (halos now valid) */ double res_local = stencil_sweep(u, new, n_loc, m_loc); /* 3. global convergence: combine every rank's residual */ MPI_Allreduce(&res_local, &res_global, 1, MPI_DOUBLE, MPI_MAX, comm); if (res_global < tol) break;
Two communication costs dominate, and both are worth modelling with the standard $\alpha$–$\beta$ model where $\alpha$ is per-message latency and $\beta$ is per-byte inverse bandwidth:
- Halo exchange — each rank sends 4 borders of $\sqrt{N^2/P}$ doubles to fixed neighbours. Cost per step $\approx 4(\alpha + \beta\cdot 8\sqrt{N^2/P})$. Crucially this is independent of $P$ for a fixed tile size — it is nearest-neighbour, so it scales.
- Allreduce — the global residual is one collective over all $P$ ranks. A naïve linear reduction costs $O(P)$ messages; a tree-based reduction costs $O(\log_2 P)$. MPI implementations use the tree, which is why the collective stays cheap at scale.
A scalar MPI_Allreduce over $P$ ranks. Count the communication rounds (each round
$\approx \alpha$ latency) for the linear and the tree algorithms.
| P | linear (P−1) | tree (2·⌈log₂P⌉) | speedup |
|---|---|---|---|
| 16 | 15 | 8 | 1.9× |
| 64 | 63 | 12 | 5.3× |
| 1024 | 1023 | 20 | 51× |
At 1024 ranks the tree does its reduce-then-broadcast in $2\lceil\log_2 1024\rceil = 20$ latency-bound rounds versus 1023 for the linear chain — a 51× reduction in latency cost. This is why the residual test, which Amdahl flagged as the serial bottleneck, must use a collective, never a hand-rolled gather-to-rank-0.
Overlapping communication and computation. The
non-blocking MPI_Isend/Irecv above let us start the halo exchange, immediately update
the tile's interior (which doesn't need halos) while messages are in flight, then update the thin
boundary ring once MPI_Waitall returns. Done well this hides almost all communication
latency behind useful work. Walk the exchange step-by-step in the
MPI halo-exchange demo; the message-passing model itself is the right column of
the Flynn taxonomy demo.
5 · GPU offload
A stencil is a near-perfect fit for a GPU: thousands of identical, independent cell updates map directly onto thousands of hardware threads. We launch one CUDA thread per interior cell, grouped into 2-D thread blocks (e.g. $16 \times 16 = 256$ threads) tiled across the grid. The GPU runs threads in lockstep warps of 32 under the SIMT (Single Instruction, Multiple Thread) model — the GPU corner of Flynn's taxonomy.
jacobi.cu — one thread per cell
__global__ void jacobi_step(const double* u, double* out, int N) { int j = blockIdx.x * blockDim.x + threadIdx.x; // column int i = blockIdx.y * blockDim.y + threadIdx.y; // row if (i > 0 && i < N-1 && j > 0 && j < N-1) { out[i*N + j] = 0.25 * ( u[(i-1)*N + j] + u[(i+1)*N + j] + u[i*N + (j-1)] + u[i*N + (j+1)] ); } } // launch: dim3 block(16,16); dim3 grid((N+15)/16, (N+15)/16); // jacobi_step<<<grid, block>>>(d_u, d_out, N);
Two properties decide whether the GPU actually goes fast:
- Coalesced memory access. Consecutive threads in a warp (consecutive
j) read consecutive addressesu[i*N + j], so the hardware merges 32 reads into a few wide memory transactions. Storing the grid row-major and mappingthreadIdx.xto the column is what makes the access coalesced — column-major mapping would scatter the reads and tank bandwidth. - Why it is still memory-bound. Each cell does ~5 flops but loads 5 doubles (40 bytes). The GPU has enormous compute throughput but the kernel can't feed it — performance is capped by memory bandwidth, not the ALUs. Using shared memory to stage each block's tile (so the 4 neighbour reads hit fast on-chip memory instead of global memory) is the standard optimisation, raising the effective arithmetic intensity.
6 · Performance analysis
Implementation is half the project; the other half is measuring and explaining what you built. We report strong scaling (fixed problem, more cores), weak scaling (problem grows with cores), the Karp–Flatt experimental serial fraction, and where the kernel lands on the roofline. The tables below are modelled from the $p = 0.97$ kernel and a representative node memory bandwidth.
Strong scaling — fixed 4096² grid
Same grid, more cores. Speedup $S(P) = T_1/T_P$, efficiency $E = S/P$. This is the Amdahl regime, so $E$ falls as the serial residual test starts to dominate.
| cores P | time T (s) | speedup S | efficiency E |
|---|---|---|---|
| 1 | 240.0 | 1.0× | 100% |
| 2 | 123.7 | 1.94× | 97% |
| 4 | 65.2 | 3.68× | 92% |
| 8 | 36.4 | 6.59× | 82% |
| 16 | 21.8 | 11.0× | 69% |
| 32 | 14.7 | 16.3× | 51% |
| 64 | 10.9 | 22.0× | 34% |
Weak scaling — 4096² cells per core
Grid grows with $P$ so each core always holds a $4096^2$ tile. Ideal weak scaling keeps step time flat; the slow creep here is the $O(\log P)$ Allreduce plus halo cost.
| cores P | grid | step time (ms) | weak efficiency |
|---|---|---|---|
| 1 | 4096² | 52.0 | 100% |
| 4 | 8192² | 53.1 | 98% |
| 16 | 16384² | 54.4 | 96% |
| 64 | 32768² | 56.2 | 93% |
| 256 | 65536² | 58.6 | 89% |
The Karp–Flatt metric recovers the real serial fraction $e$ from measured speedup, capturing overheads Amdahl's idealised $s$ ignores: $\displaystyle e = \frac{1/S - 1/P}{1 - 1/P}$. Use the strong-scaling row at $P = 8$, $S = 6.59$:
- numerator: $1/S - 1/P = 1/6.59 - 1/8 = 0.1517 - 0.1250 = 0.0267$
- denominator: $1 - 1/P = 1 - 0.125 = 0.875$
- result: $e = 0.0267 / 0.875 = \mathbf{0.0305}$ (≈ 3.0%)
- check at $P = 64$, $S = 22.0$: $e = (0.04545 - 0.015625)/0.9844 = \mathbf{0.0303}$
$e \approx 0.030$ stays flat as $P$ grows, which is the signature of a genuine serial fraction (our residual test) rather than parallel-overhead growth. If $e$ had risen with $P$, the lost efficiency would point to communication overhead instead — the diagnostic value of Karp–Flatt. Compare both tables live in the scaling-laws demo.
Arithmetic intensity $I$ = flops ÷ bytes moved from DRAM. Per interior cell the 5-point Jacobi does 4 adds + 1 multiply = 5 flops. With double precision and the unavoidable streaming traffic (read the cell's source values + write the result; with good caching the four neighbours are reused from adjacent cells, so the irreducible traffic is ~1 read + 1 write = 16 bytes):
- flops/cell: 5
- bytes/cell (best case, cached neighbours): 8 (read) + 8 (write) = 16 B
- intensity: $I = 5 / 16 = \mathbf{0.31}$ flop/byte (naïve, uncached: $5/40 = 0.125$)
- machine balance: a node with 200 GFLOP/s and 100 GB/s has ridge point $I^* = 200/100 = 2.0$ flop/byte
Since $I = 0.31 \ll I^* = 2.0$, the kernel sits firmly on the bandwidth-bound slope of the roofline, far left of the ridge. Attainable performance $= I \times \text{BW} = 0.31 \times 100 = \mathbf{31}$ GFLOP/s — only ~16% of peak, and no amount of faster arithmetic helps. The lever is bandwidth: cache blocking, shared-memory tiling on the GPU, or fusing timesteps to raise $I$. See the roofline demo and the memory-hierarchy demo.
7 · Robustness at scale
A simulation that runs for hours on hundreds of nodes will meet a node failure, a queue time limit, or a full disk. Production HPC code is judged as much on surviving these as on raw speed — two mechanisms make the project robust.
- Checkpoint / restart. Every $K$ steps each rank writes its tile and the current iteration number to disk. On restart the program detects the latest complete checkpoint and resumes from it instead of step zero. The cost is a tunable trade-off: checkpoint too often and I/O dominates; too rarely and a crash wastes hours of compute. Young's formula gives the optimal interval $\tau_{\text{opt}} = \sqrt{2\,\delta\,M}$, where $\delta$ is the time to write one checkpoint and $M$ is the mean time between failures.
- Parallel I/O. Having all ranks funnel their data to rank 0 to write a single file
re-introduces a serial bottleneck — the very thing Amdahl warned about. Instead every rank writes its own
region of one shared file concurrently with
MPI_File_write_at_all(MPI-IO), so write bandwidth scales with the parallel filesystem (e.g. Lustre) rather than a single stream.
A run on 256 nodes has a per-node MTBF of 5 years, so the system MTBF is $M = 5\text{ yr} / 256 \approx 6.4 \times 10^{5}$ s ≈ 178 h. A full checkpoint takes $\delta = 60$ s. Then $\tau_{\text{opt}} = \sqrt{2 \delta M}$:
- M in seconds: $5 \times 365 \times 24 \times 3600 / 256 \approx 6.16 \times 10^{5}$ s
- 2·δ·M: $2 \times 60 \times 6.16\times10^{5} = 7.39 \times 10^{7}$
- τ_opt: $\sqrt{7.39\times10^{7}} \approx 8597$ s ≈ 2.4 hours
So checkpoint roughly every 2.4 h, not every few minutes. As the job scales to more nodes the system MTBF shrinks ($M \propto 1/\text{nodes}$), so $\tau_{\text{opt}}$ shrinks too — at extreme scale checkpointing becomes a first-order design concern, not an afterthought.
8 · Scaling out to distributed ML
The syllabus extends HPC into distributed machine learning, and the bridge is tighter than it looks: training a large neural network is the same communication problem as our halo exchange, dressed in different vocabulary. Two parallel strategies mirror the two ways we split the stencil.
- Data parallelism ≈ domain decomposition. Every worker holds a full copy of the model but
processes a different shard of the training batch — exactly as every MPI rank ran the identical stencil on a
different tile. Each worker computes gradients locally, then all workers average them with a
gradient all-reduce before stepping. That all-reduce is the same
MPI_Allreduce(typically ring- or tree-based) we used for the global residual — only now the reduced quantity is a vector of gradients instead of one scalar. - Model parallelism ≈ halo exchange. When a model is too big for one device it is split across devices, and the activations at each split boundary must be sent to the next device every forward/backward pass — a nearest-neighbour exchange of boundary data, structurally identical to passing halo rows between neighbouring tiles.
The same analysis tools carry over wholesale. Amdahl/Gustafson bound how much faster training gets with more GPUs; the $\alpha$–$\beta$ model predicts whether the gradient all-reduce (which grows with model size) will dominate; and the roofline tells you whether a given layer is compute- or memory-bound. A student who understands the Jacobi stencil already understands the skeleton of distributed SGD.
9 · Results & reflection
A sample modelled run, end to end — serial baseline, then the same $4096^2$ grid on 8 OpenMP threads, then a $32768^2$ weak-scaled grid on 64 MPI ranks across 8 nodes:
$ python jacobi.py # serial baseline, 4096^2 converged in 1840 iters, residual 9.7e-05 wall 240.1 s $ OMP_NUM_THREADS=8 ./jacobi_omp # 1 node, 8 cores, 4096^2 converged in 1840 iters, residual 9.7e-05 wall 36.4 s (6.6x) $ mpirun -np 64 ./jacobi_mpi # 8 nodes, 32768^2 weak scaled grid 32768x32768 tiles 8x8 halo 8/step converged in 1841 iters, residual 9.9e-05 step 56.2 ms weak-eff 93% # strong-scaling on the fixed grid, summarised: $ cat scaling.txt P=1 240.0s 1.0x | P=8 36.4s 6.6x | P=64 10.9s 22.0x
What worked. The kernel parallelised cleanly: OpenMP delivered a near-linear 6.6× on one socket with a 0.2% load imbalance, and MPI weak-scaling held above 89% to 256 cores because the communication is nearest-neighbour (halo) plus a logarithmic collective (Allreduce). Tree reductions and non-blocking, overlapped halo exchange kept the serial fraction pinned at the ~3% Karp–Flatt floor.
The bottleneck. The roofline analysis was decisive: at $I = 0.31$ flop/byte the stencil is memory-bandwidth-bound, reaching only ~16% of compute peak. Faster arithmetic is wasted; the real wins come from cache blocking and on-chip tiling that cut DRAM traffic. The secondary limit is the global residual test — a true serial fraction that caps strong scaling at ~33× no matter the core count.
Mapping project stages to learning outcomes
How each stage of the build demonstrates a syllabus outcome and which live demo animates it:
| Project stage | Learning outcome demonstrated | Demo |
|---|---|---|
| Problem & baseline | Model a real problem; profile to find the hot loop | roofline, memory |
| Amdahl / Gustafson | Quantify the speedup ceiling before optimising | amdahl, scaling |
| OpenMP threading | Shared-memory parallelism, scheduling, NUMA, false sharing | openmp, flynn |
| MPI + halo exchange | Distributed memory, decomposition, collectives, comm cost | mpi |
| GPU offload | SIMT, coalescing, accelerator memory model | gpu |
| Performance analysis | Strong/weak scaling, Karp–Flatt, roofline placement | scaling, roofline |
| Robustness at scale | Checkpoint/restart, parallel I/O, fault tolerance | — |
| Scaling out to ML | Data vs model parallelism; gradient all-reduce | mapreduce |
References
-
Introduction to High Performance Computing for Scientists and EngineersThe roofline model, stencil performance, NUMA/first-touch and the memory-bound analysis used throughout sections 3, 5 and 6 trace directly to this text.
-
Using MPI: Portable Parallel Programming with the Message-Passing Interface, 3rd ed.Domain decomposition, ghost/halo exchange with non-blocking sends, and collective (Allreduce) cost models underpin section 4.
-
Programming Massively Parallel Processors, 4th ed.SIMT execution, memory coalescing and shared-memory tiling of stencils inform the GPU offload in section 5.
-
Roofline: An Insightful Visual Performance Model for Multicore ArchitecturesThe arithmetic-intensity / ridge-point analysis applied to the Jacobi kernel in section 6.
-
hpc-lab — interactive demos & course outlineThe interactive demos animate Amdahl/Gustafson, scaling, roofline, the memory hierarchy, Flynn's taxonomy, OpenMP, MPI, GPU and MapReduce; the course outline maps every session.