From-Scratch Build · Reinforcement Learning
What if you never had to write a reward function — you just swiped? This is preference-based reinforcement learning: show a human two behaviours, let them pick the better one, and learn a reward model from the swipes. I rebuilt it from scratch because it's the same idea that quietly powers how modern AI is aligned to what people actually want.
What it is
Writing a reward function is deceptively hard — say exactly what "good" means in numbers and the agent will exploit every loophole you left. Preference-based RL sidesteps the problem entirely: instead of defining the reward, you demonstrate it. The agent shows you pairs of behaviours, you swipe for the one you prefer, and a reward model is fit to your choices.
The swipe is the whole interface. Like / dislike, this-over-that — a stream of cheap binary judgements that, in aggregate, encode a reward signal no human could have written down by hand. It's the human-in-the-loop core of RLHF, made tangible.
The stack
The interesting machinery is the loop between human judgement and the learned reward.
Surface two candidate behaviours side by side and ask only one thing: which is better?
A clean probabilistic model that turns "A beat B" comparisons into continuous reward scores.
Train a small network to predict the reward such that preferred behaviours score higher.
Optimise the agent against the learned reward — not a hand-written one — and re-query as it improves.
Ask about the comparisons the model is most unsure of, so each human swipe buys the most information.
Watch for the agent gaming the proxy reward — the failure mode that makes this problem genuinely hard.
Architecture
Learning and asking alternate in a loop that tightens with every batch of swipes:
The current policy produces a set of candidate behaviours to be judged.
Pick informative pairs and present them to the human for a simple A-or-B swipe.
Update the reward model so it agrees with the collected preferences.
Optimise the agent to earn more of the learned reward.
Feed the improved behaviours back into new queries — the model and the agent co-evolve.
Reflection