From-Scratch Build · Reinforcement Learning

Swipe-to-Train

What if you never had to write a reward function — you just swiped? This is preference-based reinforcement learning: show a human two behaviours, let them pick the better one, and learn a reward model from the swipes. I rebuilt it from scratch because it's the same idea that quietly powers how modern AI is aligned to what people actually want.

PythonPreference learningReward model RLHFBradley–TerryHuman-in-the-loop

What it is

Teaching an agent by liking and disliking

Writing a reward function is deceptively hard — say exactly what "good" means in numbers and the agent will exploit every loophole you left. Preference-based RL sidesteps the problem entirely: instead of defining the reward, you demonstrate it. The agent shows you pairs of behaviours, you swipe for the one you prefer, and a reward model is fit to your choices.

The swipe is the whole interface. Like / dislike, this-over-that — a stream of cheap binary judgements that, in aggregate, encode a reward signal no human could have written down by hand. It's the human-in-the-loop core of RLHF, made tangible.

A → B
a single comparison is all a person ever gives — "I prefer A to B" — yet thousands of them shape a full reward model.

The stack

From swipes to a reward signal

The interesting machinery is the loop between human judgement and the learned reward.

interface

Pairwise queries

Surface two candidate behaviours side by side and ask only one thing: which is better?

model

Bradley–Terry

A clean probabilistic model that turns "A beat B" comparisons into continuous reward scores.

learning

Reward network

Train a small network to predict the reward such that preferred behaviours score higher.

policy

Policy optimisation

Optimise the agent against the learned reward — not a hand-written one — and re-query as it improves.

efficiency

Query selection

Ask about the comparisons the model is most unsure of, so each human swipe buys the most information.

guardrail

Reward hacking checks

Watch for the agent gaming the proxy reward — the failure mode that makes this problem genuinely hard.

Architecture

The human-in-the-loop cycle

Learning and asking alternate in a loop that tightens with every batch of swipes:

  1. Generate

    The current policy produces a set of candidate behaviours to be judged.

  2. Query

    Pick informative pairs and present them to the human for a simple A-or-B swipe.

  3. Fit reward

    Update the reward model so it agrees with the collected preferences.

  4. Improve policy

    Optimise the agent to earn more of the learned reward.

  5. Loop

    Feed the improved behaviours back into new queries — the model and the agent co-evolve.

Reflection

What rebuilding it taught me