From-Scratch Build · Interactive Robotics
A desk lamp that thinks. A six-axis robot arm carries a tiny projector instead of a bulb — point your finger at something on the table and it watches, understands what you want, and projects an answer back onto the surface. Rebuilt from scratch to learn how vision, language and motion stitch together.
What it is
Most desk robots are arms that grab things. This one is an arm that looks and shows. Its end effector isn't a gripper — it's a small projector, so the robot's job is to point a beam of useful information at exactly the right spot on the desk in front of you.
The interaction is deliberately physical. You don't type or click; you point. A camera watches the table, detects which object your finger is aimed at, and the system picks a mode — explain it, do the homework on it, generate a picture of it, or draw on the surface near it. The arm then orients the projector to display the result.
The core idea I wanted to learn: a robot becomes far more approachable when the interface is the room itself. No screen, no keyboard — just a pointing gesture and a projected reply. Building that meant wiring a perception pipeline straight into arm motion.
The stack
Each piece of this was new to me. Here is what each one actually does in the system.
The messaging bus. Vision nodes, the mode selector and the motion node all publish and subscribe to topics, so each stays a small independent program.
Computer vision that locates the hand, follows the finger ray and decides which object on the table is being pointed at.
Turns "aim the projector here" into safe joint angles, broadcasting markers so the arm and target stay in one shared coordinate frame.
The end effector. Instead of grabbing, the robot projects images, text and answers directly onto the desk surface.
A generative layer that produces a picture or response from what was pointed at, ready to be projected back.
Custom ROS messages signal the active mode — think, do homework, generate image, draw, open links — so the right behaviour fires.
Architecture
The behaviour is a chain of small ROS nodes. Each does one job and hands off to the next over a topic.
A vision node reads the scene and decides which interaction mode the user is asking for.
Follows the finger to work out which object on the table is the target.
Activates the chosen behaviour and gathers whatever input it needs.
Generates or processes the visual that will be shown back to the user.
Places the target in the arm's coordinate frame so motion is aimed accurately.
Drives the arm to orient the projector and display the result on the desk.
How it runs
The whole system comes up from a single launch description that starts every node at once. From there the modes are interactive:
In my rebuild I leaned on the pointing-to-motion path: detect the gesture, place a marker in the arm frame, and let one roslaunch bring the perception and control nodes up together.
Reflection