Marker Tracking Lab — Built From Scratch

What it is

Where is that thing, exactly?

The lab's job is pose estimation: given a video frame, find each printed marker, and report not just where it is on screen but its full position and orientation in metres. Those poses are published as transforms so they can be drawn in a 3D viewer and consumed by anything else on the robotics bus.

The second half is the part I most wanted to build: frame alignment. A camera sees the world from its own viewpoint; a motion-capture rig has its own origin. To make them cooperate, the camera frame has to be expressed in the tracker's coordinates. Once aligned, a marker detected by the camera and a robot tracked by motion capture live in the same map.

The core idea I wanted to learn: a fiducial marker is a cheap, reliable anchor. Because its size and pattern are known, one camera image is enough to recover a full 3D pose — and once you can express that pose in a shared world frame, separate sensing systems suddenly speak the same language.

The stack

Tools under the hood

This rebuild sits at the meeting point of optics, computer vision and robotics plumbing. Here is what each piece does.

capture

Industrial camera

A machine-vision camera driver delivers a steady, high-resolution image stream into the pipeline.

vision

OpenCV ArUco

Detects fiducial markers and, with the camera's intrinsics, solves each marker's 3D pose from a single frame.

calibration

Camera calibration

The camera matrix and distortion coefficients that turn raw pixels into accurate, undistorted measurements.

middleware

ROS + TF

Poses are published on the transform tree so every node shares one consistent picture of where things are.

ground truth

Motion capture

A tracker provides the authoritative world frame that the camera is aligned to.

visualisation

3D viewer

A robotics visualiser draws markers, frames and the camera live, so alignment errors are obvious at a glance.

Architecture

Three nodes, one map

The lab runs as a few independent ROS nodes you bring up in order. Keeping them separate means you can debug the camera without touching detection, or detection without the tracker.

Camera node live
Starts the machine-vision camera and exposes its rectified image stream to the rest of the graph.
Calibration live
Feeds the camera matrix and distortion coefficients in, so measurements are metrically accurate.
Marker detection live
Finds allowed markers in the rectified feed and publishes their poses to the transform tree.
Tracker client live
Brings in the motion-capture world frame and the poses it reports.
Frame alignment live
A static transform that expresses the camera in the tracker's coordinates, fusing both worlds.
Web bridge live
A bridge server so the live data can be reached from outside the robotics graph.

How it runs

Calibrate, detect, align

Getting trustworthy poses is a discipline, not a one-liner. The order matters because every later step assumes the earlier ones are correct:

Calibrate on the rectified feed: the camera matrix must come from the same image the detector actually sees, or the poses drift.
Restrict the markers: only an allow-list of marker IDs and a known physical size are accepted, which kills false detections.
Publish to TF: every pose goes onto the transform tree, so position is always expressed relative to a named frame.
Measure the offset: the camera-to-tracker transform comes from real measured distances from the tracker's origin.

In my rebuild I treated calibration as the foundation — a beautiful detector on a badly calibrated camera just produces confident, wrong numbers.

Reflection

What rebuilding it taught me

Calibration is the whole ballgame. Pose accuracy lives or dies on the camera matrix and distortion model — the clever detection is the easy part.
Fiducials are an honest shortcut. A known pattern at a known size lets one frame yield a full 3D pose; no deep learning required.
Coordinate frames are a contract. The transform tree forces you to say relative to what, which is exactly the discipline that lets two sensing systems agree.
Alignment is measurement, not code. The hardest accuracy gains came from carefully measuring the real offset to the tracker's origin, not from the software.

Where is that thing, exactly?

Tools under the hood

Industrial camera

OpenCV ArUco

Camera calibration

ROS + TF

Motion capture

3D viewer

Three nodes, one map

Camera node live

Calibration live

Marker detection live

Tracker client live

Frame alignment live

Web bridge live

Calibrate, detect, align

What rebuilding it taught me