← all builds

From-Scratch Build · 07 · Distributed Computing

HPC Execution Manager

You have a workflow of tasks and a fistful of supercomputer allocations. How many "pilots" do you launch, and where? This Execution Manager answers that automatically — a lens into how serious distributed computing schedules work across many machines at once.

Pilot frameworkPilot jobsWorkflows SupercomputersProof-of-concept

What it is

A broker between work and machines

EManager ("Execution Manager") takes a workflow description and runs it. You hand it a set of tasks — each needing so many cores for so long — and it executes them using a pilot framework across a dynamically chosen set of HPC resources. Crucially, it decides how many pilots to launch based on the work's size and what it learns about each target machine.

It's a proof-of-concept, deliberately small, and that's exactly why it's worth building: it shows the moving parts of distributed execution stripped down to essentials. It also pairs naturally with my Slack Command Bot build — two ends of the same "long-running coordinator" idea.

The core idea I wanted to learn: the pilot job abstraction. Instead of submitting each task to a cluster queue, you queue a few big placeholder jobs ("pilots"), then stream your real tasks into them — dodging the queue and packing resources efficiently.

The stack

The distributed toolkit

pilots

Pilot framework

The pilot-job engine. Acquires resources as placeholders, then schedules your tasks onto them.

workflow

Workflow skeleton

A synthetic workflow descriptor — lets you generate realistic task graphs to test the manager.

info system

Resource bundles

Knowledge about resource properties, so the manager can make informed placement decisions.

access layer

Unified access API

A single uniform API over heterogeneous HPC schedulers — one interface, many supercomputers.

coordination

Remote objects

Python remote objects — how the distributed components call each other across the network.

targets

Supercomputers

Allocations on large shared HPC systems — the machines the tasks actually ran on.

How it works

From workflow to placement

  1. Describe the workflow

    A skeleton defines the tasks, each with a core count and expected duration.

  2. Read the resources

    Bundles supply each target machine's properties — what's available, how big, how busy.

  3. Size the pilots

    The manager computes how many pilots and where from the task demand and resource info.

  4. Launch the pilots

    Pilots are submitted (through the unified access API) to the chosen machines as resource placeholders.

  5. Stream & execute tasks

    Real tasks flow into the running pilots and execute — no per-task queue wait.

Reflection

What rebuilding it taught me