HPC Execution Manager — Built From Scratch

What it is

A broker between work and machines

EManager ("Execution Manager") takes a workflow description and runs it. You hand it a set of tasks — each needing so many cores for so long — and it executes them using a pilot framework across a dynamically chosen set of HPC resources. Crucially, it decides how many pilots to launch based on the work's size and what it learns about each target machine.

It's a proof-of-concept, deliberately small, and that's exactly why it's worth building: it shows the moving parts of distributed execution stripped down to essentials. It also pairs naturally with my Slack Command Bot build — two ends of the same "long-running coordinator" idea.

The core idea I wanted to learn: the pilot job abstraction. Instead of submitting each task to a cluster queue, you queue a few big placeholder jobs ("pilots"), then stream your real tasks into them — dodging the queue and packing resources efficiently.

The stack

The distributed toolkit

pilots

Pilot framework

The pilot-job engine. Acquires resources as placeholders, then schedules your tasks onto them.

workflow

Workflow skeleton

A synthetic workflow descriptor — lets you generate realistic task graphs to test the manager.

info system

Resource bundles

Knowledge about resource properties, so the manager can make informed placement decisions.

access layer

Unified access API

A single uniform API over heterogeneous HPC schedulers — one interface, many supercomputers.

coordination

Remote objects

Python remote objects — how the distributed components call each other across the network.

targets

Supercomputers

Allocations on large shared HPC systems — the machines the tasks actually ran on.

How it works

From workflow to placement

Describe the workflow
A skeleton defines the tasks, each with a core count and expected duration.
Read the resources
Bundles supply each target machine's properties — what's available, how big, how busy.
Size the pilots
The manager computes how many pilots and where from the task demand and resource info.
Launch the pilots
Pilots are submitted (through the unified access API) to the chosen machines as resource placeholders.
Stream & execute tasks
Real tasks flow into the running pilots and execute — no per-task queue wait.

Reflection

What rebuilding it taught me

Pilot jobs are a scheduling cheat-code. Queue a placeholder once, then run many tasks inside it. It reframes how I think about cluster queues entirely.
Abstraction over schedulers is half the battle. A unified access API exists because every supercomputer has its own quirks; uniformity is what makes "many machines" tractable.
"Dynamic" means data-driven. The number of pilots isn't hardcoded — it's derived from task demand and live resource info. That's the actual intelligence.
A proof-of-concept is honest about its limits. This build stays a non-portable demo on purpose — a healthy reminder that demos and products are different things.

A broker between work and machines

The distributed toolkit

Pilot framework

Workflow skeleton

Resource bundles

Unified access API

Remote objects

Supercomputers

From workflow to placement

Describe the workflow

Read the resources

Size the pilots

Launch the pilots

Stream & execute tasks

What rebuilding it taught me