IE BCSAI · High-Performance Computing

Watching a stencil compute the weather.

A 3D, 7-point stencil mini-weather solver, designed to run on HPC clusters with MPI + OpenMP + CUDA. This page lets you see what that algorithm actually does: a 2D port runs live in your browser, alongside the real outputs produced by the cluster runs.

Live

Interactive 2D stencil

The cluster code applies a 7-point stencil on a 3D grid. The same kernel logic runs here on a 2D slice — every frame, each cell is replaced by a weighted average of its neighbors plus a small diffusion term. Heat sources, sinks and an optional wind keep the field interesting.

step0
cells / s
fps
From the cluster

Real outputs from real runs

These images are produced by the C++ code running on the Magic Castle cluster — checked in under source/results/ and rendered as-is here. The browser simulation above is a live 2D analogue; the images below are the genuine 3D solver output.

Animation

Time evolution of the field, exported as an animated GIF from the 3D run.

Animated GIF showing time evolution of the 3D stencil field

Wave preset

Plane-wave initial condition, 2D slices through the volume.

2D slices of the wave preset 3D isosurface of the wave preset

Sphere preset

Spherical Gaussian, diffused outward by the 7-point operator.

2D slices of the sphere preset 3D isosurface of the sphere preset

Default

Default initial state from the default scenario.

2D slices of the default scenario 3D isosurface of the default scenario
Why this matters

The four backends

The same stencil is implemented four ways. Each one is one of the source files committed in source/src/. The point of the project is to see how much faster the same physics gets as you move from a textbook implementation to one that uses every resource on a node.

Serial

Single thread, straight triple loop. Baseline — every other variant is measured against this.

src/stencil_cpu_serial.cpp
Cache-blocked

Same algorithm, but loops are tiled so each block fits in L2/L1. Same FLOPs, fewer cache misses.

src/stencil_cpu_blocked.cpp
CPU parallel (OpenMP / MPI)

Threads across cores, ranks across nodes; halo exchange keeps the stencil consistent at boundaries.

src/stencil_cpu_parallel.cpp · halo.cpp
GPU (CUDA)

Same operator, but each cell becomes one CUDA thread. Coalesced loads + shared-memory tiling.

src/stencil_gpu.cpp · stencil_gpu.cu