IE BCSAI · High-Performance Computing

Watching a stencil compute the weather.

A 3D, 7-point stencil mini-weather solver, designed to run on HPC clusters with MPI + OpenMP + CUDA. This page lets you see what that algorithm actually does: a 2D port runs live in your browser, alongside the real outputs produced by the cluster runs.

Run the simulation See cluster outputs

Live

Interactive 2D stencil

The cluster code applies a 7-point stencil on a 3D grid. The same kernel logic runs here on a 2D slice — every frame, each cell is replaced by a weighted average of its neighbors plus a small diffusion term. Heat sources, sinks and an optional wind keep the field interesting.

step0

cells / s—

fps—

Initial condition

Diffusion 0.20

Buoyancy 0.10

Wind (horizontal) 0.00

Resolution 128 × 96 Higher = sharper, but more cells per step.

What does the stencil do?

Each timestep computes u'_i,j = (1−6α)·u_i,j + α·(u_i±1,j + u_i,j±1 + u_i,j±0) — the discrete heat equation as a finite-difference stencil. In the cluster code there's a third axis, so the kernel touches 7 cells (centre + 6 neighbours). Buoyancy adds an upward pull on hot cells; wind adds horizontal advection.

On an HPC cluster the same operation is partitioned across ranks (MPI), each rank threads across cores (OpenMP), and each step exchanges thin "halo" layers between neighbours. See the actual source files →

From the cluster

Real outputs from real runs

These images are produced by the C++ code running on the Magic Castle cluster — checked in under source/results/ and rendered as-is here. The browser simulation above is a live 2D analogue; the images below are the genuine 3D solver output.

Animation

Time evolution of the field, exported as an animated GIF from the 3D run.

Animated GIF showing time evolution of the 3D stencil field

Wave preset

Plane-wave initial condition, 2D slices through the volume.

Sphere preset

Spherical Gaussian, diffused outward by the 7-point operator.

Default

Default initial state from the default scenario.

Why this matters

The four backends

The same stencil is implemented four ways. Each one is one of the source files committed in source/src/. The point of the project is to see how much faster the same physics gets as you move from a textbook implementation to one that uses every resource on a node.

Serial

Single thread, straight triple loop. Baseline — every other variant is measured against this.

src/stencil_cpu_serial.cpp

Cache-blocked

Same algorithm, but loops are tiled so each block fits in L2/L1. Same FLOPs, fewer cache misses.

src/stencil_cpu_blocked.cpp

CPU parallel (OpenMP / MPI)

Threads across cores, ranks across nodes; halo exchange keeps the stencil consistent at boundaries.

src/stencil_cpu_parallel.cpp · halo.cpp

GPU (CUDA)

Same operator, but each cell becomes one CUDA thread. Coalesced loads + shared-memory tiling.

src/stencil_gpu.cpp · stencil_gpu.cu