Swipe-to-Train — Built From Scratch

What it is

Teaching an agent by liking and disliking

Writing a reward function is deceptively hard — say exactly what "good" means in numbers and the agent will exploit every loophole you left. Preference-based RL sidesteps the problem entirely: instead of defining the reward, you demonstrate it. The agent shows you pairs of behaviours, you swipe for the one you prefer, and a reward model is fit to your choices.

The swipe is the whole interface. Like / dislike, this-over-that — a stream of cheap binary judgements that, in aggregate, encode a reward signal no human could have written down by hand. It's the human-in-the-loop core of RLHF, made tangible.

A → B

a single comparison is all a person ever gives — "I prefer A to B" — yet thousands of them shape a full reward model.

The stack

From swipes to a reward signal

The interesting machinery is the loop between human judgement and the learned reward.

interface

Pairwise queries

Surface two candidate behaviours side by side and ask only one thing: which is better?

model

Bradley–Terry

A clean probabilistic model that turns "A beat B" comparisons into continuous reward scores.

learning

Reward network

Train a small network to predict the reward such that preferred behaviours score higher.

policy

Policy optimisation

Optimise the agent against the learned reward — not a hand-written one — and re-query as it improves.

efficiency

Query selection

Ask about the comparisons the model is most unsure of, so each human swipe buys the most information.

guardrail

Reward hacking checks

Watch for the agent gaming the proxy reward — the failure mode that makes this problem genuinely hard.

Architecture

The human-in-the-loop cycle

Learning and asking alternate in a loop that tightens with every batch of swipes:

Generate
The current policy produces a set of candidate behaviours to be judged.
Query
Pick informative pairs and present them to the human for a simple A-or-B swipe.
Fit reward
Update the reward model so it agrees with the collected preferences.
Improve policy
Optimise the agent to earn more of the learned reward.
Loop
Feed the improved behaviours back into new queries — the model and the agent co-evolve.

Reflection

What rebuilding it taught me

Preferences are easier than rewards. People are unreliable at assigning scores but remarkably consistent at picking the better of two options.
The reward model is the product. Once you can learn a reward from comparisons, the policy optimisation is the familiar part — the hard, valuable bit is the model of human taste.
Which question you ask matters. Querying the most uncertain comparisons makes a handful of human swipes go a very long way.
Proxies get gamed. An agent optimising a learned reward will find its blind spots; watching for reward hacking is half the work.
This is how alignment scales. The same loop — compare, learn a reward, optimise — is the backbone of training AI to follow human intent.

Teaching an agent by liking and disliking

From swipes to a reward signal

Pairwise queries

Bradley–Terry

Reward network

Policy optimisation

Query selection

Reward hacking checks

The human-in-the-loop cycle

Generate

Query

Fit reward

Improve policy

Loop

What rebuilding it taught me