Robot Manipulator Datasets — Built From Scratch

What it is

Data from a robot that copies itself

The subject is a small six-axis robot manipulator — an arm with six joints — paired with a digital twin, a live software copy of that arm running on a server. As the real arm moves, the twin mirrors it, and the system records what's happening: how loaded the machines are, how the network behaves, how long each instance takes to respond. Those recordings are the datasets.

I built this to learn the unglamorous but essential half of robotics and machine learning: the data. Not the model, not the robot — the columns in a CSV file, the sampling intervals, the labels, and the decisions behind them that decide whether anything you train later is any good.

The core idea I wanted to learn: a dataset is a designed artefact, not a byproduct. Choosing what to sample, how often, and how to label it is the real work — and it's what turns raw robot telemetry into something a classifier can actually learn from.

The stack

What the data is made of

Two datasets, captured two different ways. Here is what each piece actually is.

subject

Six-axis arm

A compact desktop manipulator with six rotational joints. Its geometry is described by a robot model file, so the twin and any viewer know exactly how the arm is shaped.

source

Digital twin

A software replica of the arm running on edge infrastructure. The datasets are telemetry from this twin and the system hosting it, not raw video of the hardware.

dataset A

Scalability traces

The system is pushed harder over time by adding a new virtual robot instance at a fixed interval. The data records how resource use and timing respond as load climbs.

dataset B

Motion traces

The arm performs four distinct movements, each repeated twenty times. One CSV per movement captures the resulting time-series — clean, repeatable, comparable.

format

CSV time-series

Everything is plain comma-separated values: rows over time, columns of measurements. Portable, diff-able, and openable in anything from a spreadsheet to pandas.

labels

Labelled variant

The largest scalability file ships in a labelled version, with each row tagged so it can directly train a supervised classifier such as a random forest.

Architecture

How the data is organised

The two datasets are kept separate because they answer different questions. The scalability set comes in three resolutions; the motion set is split by movement.

Micro sampling live
Scalability dataset where a new robot instance is added every 60 seconds — the fine-grained view, lots of detail over a short window.
Small sampling live
Same experiment, a new instance every 300 seconds — a middle resolution that trades detail for a longer, calmer trace.
Big sampling live
A new instance every 3600 seconds — the coarse, long-horizon view of how the system scales over hours.
Labelled big set live
The big dataset with per-row labels added, ready to train and evaluate a supervised classifier.
Motion CSVs live
Four files, one per movement, each capturing the arm repeating that motion twenty times alongside a reference clip and the robot model.
Generation script live
A small Python script that produced the scalability traces — the reproducible recipe behind the numbers, not a hand-edited file.

How it's used

From CSV to a trained model

A dataset is only as useful as what you can do with it. These two were shaped with concrete uses in mind:

Scaling decisions: the scalability traces show how a digital-twin service behaves as demand grows, which is exactly what you need to decide when to add or remove capacity.
Supervised classification: the labelled big set drops straight into a training pipeline — split it, fit a random-forest classifier, and predict the system's state from its telemetry.
Motion comparison: four repeated movements give a clean baseline for telling actions apart, or for measuring how consistently the twin reproduces the real arm.

In my rebuild I focused on the data-handling path: load the CSVs, understand each column, line the three sampling resolutions up against each other, and confirm the labelled set really is ready to train on without further cleaning.

Reflection

What rebuilding it taught me

Sampling interval is a design knob. Micro, small and big aren't three datasets — they're the same experiment seen at three zoom levels, and the choice changes what you can learn from it.
Labels are the expensive part. The raw trace is easy to capture; deciding what each row means and tagging it is the work that makes supervised learning possible.
CSV is underrated. No exotic format, no binary blob — plain rows and columns travel everywhere and stay readable years later. Boring is a feature.
Reproducibility lives in the script. Shipping the generator alongside the data means the numbers aren't a mystery — anyone can see exactly how they were produced.

Data from a robot that copies itself

What the data is made of

Six-axis arm

Digital twin

Scalability traces

Motion traces

CSV time-series

Labelled variant

How the data is organised

Micro sampling live

Small sampling live

Big sampling live

Labelled big set live

Motion CSVs live

Generation script live

From CSV to a trained model

What rebuilding it taught me