From-Scratch Build · Robotics Data
Two datasets captured from a six-axis robot arm and its software twin — one watching the system scale up under load, one recording the arm repeating four motions. Rebuilt from scratch to understand how robotics data is collected, formatted and used.
What it is
The subject is a small six-axis robot manipulator — an arm with six joints — paired with a digital twin, a live software copy of that arm running on a server. As the real arm moves, the twin mirrors it, and the system records what's happening: how loaded the machines are, how the network behaves, how long each instance takes to respond. Those recordings are the datasets.
I built this to learn the unglamorous but essential half of robotics and machine learning: the data. Not the model, not the robot — the columns in a CSV file, the sampling intervals, the labels, and the decisions behind them that decide whether anything you train later is any good.
The core idea I wanted to learn: a dataset is a designed artefact, not a byproduct. Choosing what to sample, how often, and how to label it is the real work — and it's what turns raw robot telemetry into something a classifier can actually learn from.
The stack
Two datasets, captured two different ways. Here is what each piece actually is.
A compact desktop manipulator with six rotational joints. Its geometry is described by a robot model file, so the twin and any viewer know exactly how the arm is shaped.
A software replica of the arm running on edge infrastructure. The datasets are telemetry from this twin and the system hosting it, not raw video of the hardware.
The system is pushed harder over time by adding a new virtual robot instance at a fixed interval. The data records how resource use and timing respond as load climbs.
The arm performs four distinct movements, each repeated twenty times. One CSV per movement captures the resulting time-series — clean, repeatable, comparable.
Everything is plain comma-separated values: rows over time, columns of measurements. Portable, diff-able, and openable in anything from a spreadsheet to pandas.
The largest scalability file ships in a labelled version, with each row tagged so it can directly train a supervised classifier such as a random forest.
Architecture
The two datasets are kept separate because they answer different questions. The scalability set comes in three resolutions; the motion set is split by movement.
Scalability dataset where a new robot instance is added every 60 seconds — the fine-grained view, lots of detail over a short window.
Same experiment, a new instance every 300 seconds — a middle resolution that trades detail for a longer, calmer trace.
A new instance every 3600 seconds — the coarse, long-horizon view of how the system scales over hours.
The big dataset with per-row labels added, ready to train and evaluate a supervised classifier.
Four files, one per movement, each capturing the arm repeating that motion twenty times alongside a reference clip and the robot model.
A small Python script that produced the scalability traces — the reproducible recipe behind the numbers, not a hand-edited file.
How it's used
A dataset is only as useful as what you can do with it. These two were shaped with concrete uses in mind:
In my rebuild I focused on the data-handling path: load the CSVs, understand each column, line the three sampling resolutions up against each other, and confirm the labelled set really is ready to train on without further cleaning.
Reflection