From-Scratch Build · High-Performance Computing

Mandelbrot, parallelised and measured

Take one genuinely compute-heavy workload — escape-time rendering of the Mandelbrot set — and make it fly: profile the hotspot, vectorise it with NumPy, spread it across CPU cores with multiprocessing, and then prove the speedup with a real strong-scaling study. All four implementations return the same pixels; every speedup is a measured win, not a changed answer.

Python 3.12NumPymultiprocessing numba (opt)cProfileStrong scaling

The output

What it computes

For every pixel c of a complex-plane grid, iterate z → z² + c until |z| > 2 and record how many steps it took. That iteration count is the image. The workload is embarrassingly parallel — pixels are independent — but the per-pixel cost is wildly uneven (points inside the set run the full budget), which is what makes the scaling honest rather than trivially linear.

Mandelbrot render — a zoom into the seahorse-valley boundary, rendered with the parallel kernel
800×800, max_iter=1000, a zoom into the seahorse-valley boundary — rendered with the parallel kernel, identical pixel-for-pixel to the naive reference.

The implementations

Four ways to render the same image

Same input, same output array (verified), increasing speed.

reference

naive

A pure-Python double loop over pixels. Correct and readable, and the baseline everything is timed and checked against.

vectorised

numpy

One NumPy pass over the whole grid — Python-interpreter overhead paid once per iteration, not once per pixel.

parallel

multiprocessing

Split the rows into bands, run the NumPy kernel per band in its own process, stitch the bands back together.

optional JIT

numba

An @njit(parallel=True) kernel, enabled automatically if numba is installed. In-process threads, no spawn cost.

Profile first

The hotspot is exactly where you'd hope

cProfile on the naive kernel is unambiguous: essentially 100% of the time is the per-pixel escape loop in mandelbrot_naive — which is precisely the function the other three kernels replace.

   ncalls  tottime  cumtime  filename:lineno(function)
        1    0.588    0.588  hpc/mandelbrot.py:80(mandelbrot_naive)
        2    0.000    0.000  numpy/function_base.py:24(linspace)
>>> Hotspot (most self-time): mandelbrot_naive

The evidence

Strong-scaling study — real numbers

Fix the problem size (800×800, max_iter=1000), increase the worker processes, take the best of 3 runs after a warmup. Measured on a 16-core machine. speedup = T(1)/T(p), efficiency = speedup/p.

workerstime (s)speedupefficiency
123.8671.00x100.0%
215.2561.56x78.2%
413.8381.72x43.1%
810.5072.27x28.4%
168.6522.76x17.2%
Strong-scaling speedup curve: measured speedup well below the ideal linear line
Speedup vs. ideal linear.
Parallel efficiency curve dropping from 100% toward 17% as cores increase
Efficiency falls as cores grow.

Read honestly. The speedup is real (2.76x on 16 cores) but plainly sub-linear, and efficiency collapses as cores are added. That's Amdahl's law: the fixed overhead of this setup — spawning processes, pickling each result band back, the final concatenation — is work a fast NumPy kernel finishes in seconds, so more processes hit diminishing returns fast. On Windows the only portable start method is spawn, whose per-process cost is high, which is why the knee comes early.

Head to head

Where the speed actually comes from

Same 400×400 view, max_iter=400, all four kernels.

kerneltime (s)vs naive
naive (pure Python)8.0741.0x
numpy (vectorised)2.2203.6x
parallel (16 cores)2.5293.2x
numba (JIT, threads)0.0081021x

The honest punchline: vectorisation is the big win. At this size the multiprocessing version barely matches single-process NumPy — spawn overhead eats the gain — and only pulls ahead on the larger scaling instance above. Numba's in-process JIT avoids all process overhead and wins by three orders of magnitude: the right tool when you can use it.

Scope

Single node, honestly — and how it reaches MPI

This is a single-node, multi-core study. A real cluster with many machines and MPI is out of scope for one laptop — so the page doesn't claim it. But the decomposition used here is the distributed one, and the jump is mechanical:

  1. Domain decomposition

    Give MPI rank r of n its slice of rows — the same split this code already does.

  2. No halo exchange

    Pixels are independent, so unlike a stencil or PDE solver there is zero communication during compute — only the final gather.

  3. Gather

    Replace the process-pool map and concatenate with an MPI Gatherv onto rank 0.

  4. The real ceiling

    Communication is cheap (one gather); load imbalance isn't — an equal-rows split hands some ranks mostly-interior, expensive rows. A serious version would deal out smaller chunks dynamically.