From-Scratch Build · Interactive Performance
A DJ setup where the hands do double duty: motion-capture gloves and the live audio both feed a projected fluid-simulation visual, so the same gestures that shape the music also paint the light. I built this from scratch to learn how gesture tracking, audio analysis and real-time graphics lock together.
What it is
This is a tangible interface for DJing built around one ambition: make the performer visible. A motion-capture system tracks marker gloves on the DJ's hands; a real-time audio analyser listens to the music; and a WebGL fluid simulation, projected onto the DJ's workspace, responds to both. The result is that gestures and sound continuously reshape a living visual that the audience sees.
It runs in four interaction modes — knobs change the music, music changes the visuals, gestures change the visuals, and two-handed gestures draw EQ curves back onto the music. I built it to understand how to fuse two real-time input streams, motion and audio, into a single coherent output.
The core idea I wanted to learn: expressive interfaces are about mapping, not sensing. Tracking a hand is easy; deciding how a hand's motion should bend a fluid simulation — and when the audio should override it — is the whole craft.
The stack
The point of this rebuild was the toolchain. Here is what each piece actually does in the system.
Tracks marker gloves on the DJ's hands, reporting position and orientation so gestures become data.
Processes the capture stream and forwards hand coordinates to the visual engine over UDP.
Listens to the live music with low latency and extracts a signal used to trigger and modulate the visuals.
A browser-based fluid simulation in JavaScript — the canvas the gestures and audio paint onto.
Hand coordinates are mapped onto pointer positions in the simulation; an audio threshold decides when a "click" fires.
The finished visual is projected onto the DJ's table, putting the performance in the same space as the performer.
Pipeline
Motion and audio travel separate paths and meet inside the visual engine, where they're fused into one image.
Motion capture reports the position of the gloved hands.
A ROS node streams the hand coordinates to the visual engine over UDP.
The audio analyser turns the live music into a triggering signal.
Hand coordinates become pointer positions; the audio signal decides when to click.
The fluid simulation reacts — following the hands and bursting on audio triggers.
The visual is projected onto the DJ's table for the audience to see.
Four modes
What makes this more than a visualiser is that influence runs in both directions — gesture and audio each touch both the music and the visuals:
In my rebuild I focused first on the visual loop — mapping hands to pointers and gating clicks on the audio threshold — because getting that fusion right is what sells the whole performance.
Reflection