From-Scratch Build · 07 · Distributed Computing
You have a workflow of tasks and a fistful of supercomputer allocations. How many "pilots" do you launch, and where? This Execution Manager answers that automatically — a lens into how serious distributed computing schedules work across many machines at once.
What it is
EManager ("Execution Manager") takes a workflow description and runs it. You hand it a set of tasks — each needing so many cores for so long — and it executes them using a pilot framework across a dynamically chosen set of HPC resources. Crucially, it decides how many pilots to launch based on the work's size and what it learns about each target machine.
It's a proof-of-concept, deliberately small, and that's exactly why it's worth building: it shows the moving parts of distributed execution stripped down to essentials. It also pairs naturally with my Slack Command Bot build — two ends of the same "long-running coordinator" idea.
The core idea I wanted to learn: the pilot job abstraction. Instead of submitting each task to a cluster queue, you queue a few big placeholder jobs ("pilots"), then stream your real tasks into them — dodging the queue and packing resources efficiently.
The stack
The pilot-job engine. Acquires resources as placeholders, then schedules your tasks onto them.
A synthetic workflow descriptor — lets you generate realistic task graphs to test the manager.
Knowledge about resource properties, so the manager can make informed placement decisions.
A single uniform API over heterogeneous HPC schedulers — one interface, many supercomputers.
Python remote objects — how the distributed components call each other across the network.
Allocations on large shared HPC systems — the machines the tasks actually ran on.
How it works
A skeleton defines the tasks, each with a core count and expected duration.
Bundles supply each target machine's properties — what's available, how big, how busy.
The manager computes how many pilots and where from the task demand and resource info.
Pilots are submitted (through the unified access API) to the chosen machines as resource placeholders.
Real tasks flow into the running pilots and execute — no per-task queue wait.
Reflection