RL Control Lab — Built From Scratch

What it is

A bench where RL methods compete

Reinforcement learning has a sprawling family of algorithms, and reading about them only gets you so far — they truly click when you run two side by side on the same problem and watch one learn faster, more stably, or not at all. This lab is exactly that: a shared harness where each method plugs into the same environments and reports the same metrics.

The point isn't a single clever agent — it's the comparison. Value-based vs policy-based, on-policy vs off-policy, tabular vs approximate. By holding the environment fixed and swapping only the algorithm, the trade-offs that textbooks describe in prose become curves you can actually see.

1 : N

one shared environment harness, many interchangeable algorithms — the design that makes honest comparison possible.

The stack

The methods on the bench

Each is implemented from first principles, not imported from a library.

planning

Value iteration

When the model is known, sweep the Bellman equation to optimality — the baseline every learner is measured against.

off-policy

Q-learning

Learn the optimal action-values directly, even while exploring — the workhorse of tabular RL.

on-policy

SARSA

Learn the value of the policy you're actually following — more cautious, and revealing about the difference.

policy-based

Policy gradients

Skip values; nudge the policy parameters directly in the direction of higher reward.

harness

Shared environment API

A common interface so any agent can run on any task without touching the algorithm code.

analysis

Learning curves

Reward-over-time plots and stability metrics, logged identically for every method.

Architecture

How a comparison runs

Every experiment follows the same controlled procedure so results are actually comparable:

Fix the task
Choose a control environment and freeze its settings for the whole sweep.
Plug an agent
Drop in one algorithm through the shared interface — nothing else changes.
Train
Run the agent for a fixed budget of episodes, logging reward and stability throughout.
Repeat across methods
Swap the agent and rerun under identical conditions and seeds.
Compare
Overlay the learning curves to read off speed, stability and final performance.

Reflection

What rebuilding it taught me

Implementing beats reading. The Bellman equation is one line of maths and a surprising amount of bookkeeping; writing it is what makes it real.
On-policy vs off-policy is a personality. SARSA learns a safer path, Q-learning a bolder one — the same task, two genuinely different agents.
A fair harness is the hard part. Most of the engineering is making sure the only thing that changes between runs is the algorithm.
Seeds and variance matter. One lucky run proves nothing; comparing methods honestly means averaging over many.
Policy gradients think differently. Optimising a policy directly, with no value table in sight, reframes the whole problem.

A bench where RL methods compete

The methods on the bench

Value iteration

Q-learning

SARSA

Policy gradients

Shared environment API

Learning curves

How a comparison runs

Fix the task

Plug an agent

Train

Repeat across methods

Compare

What rebuilding it taught me