From-Scratch Build · Reinforcement Learning

RL Control Lab

One environment, many algorithms. This is a workbench for the classic reinforcement-learning methods — value iteration, Q-learning, SARSA, policy gradients — each implemented from scratch and pitted against the same control tasks so you can see exactly how they differ. I built it to turn a reading list of RL chapters into something I could run.

PythonQ-learningSARSA Policy gradientsGymnasiumBenchmarking

What it is

A bench where RL methods compete

Reinforcement learning has a sprawling family of algorithms, and reading about them only gets you so far — they truly click when you run two side by side on the same problem and watch one learn faster, more stably, or not at all. This lab is exactly that: a shared harness where each method plugs into the same environments and reports the same metrics.

The point isn't a single clever agent — it's the comparison. Value-based vs policy-based, on-policy vs off-policy, tabular vs approximate. By holding the environment fixed and swapping only the algorithm, the trade-offs that textbooks describe in prose become curves you can actually see.

1 : N
one shared environment harness, many interchangeable algorithms — the design that makes honest comparison possible.

The stack

The methods on the bench

Each is implemented from first principles, not imported from a library.

planning

Value iteration

When the model is known, sweep the Bellman equation to optimality — the baseline every learner is measured against.

off-policy

Q-learning

Learn the optimal action-values directly, even while exploring — the workhorse of tabular RL.

on-policy

SARSA

Learn the value of the policy you're actually following — more cautious, and revealing about the difference.

policy-based

Policy gradients

Skip values; nudge the policy parameters directly in the direction of higher reward.

harness

Shared environment API

A common interface so any agent can run on any task without touching the algorithm code.

analysis

Learning curves

Reward-over-time plots and stability metrics, logged identically for every method.

Architecture

How a comparison runs

Every experiment follows the same controlled procedure so results are actually comparable:

  1. Fix the task

    Choose a control environment and freeze its settings for the whole sweep.

  2. Plug an agent

    Drop in one algorithm through the shared interface — nothing else changes.

  3. Train

    Run the agent for a fixed budget of episodes, logging reward and stability throughout.

  4. Repeat across methods

    Swap the agent and rerun under identical conditions and seeds.

  5. Compare

    Overlay the learning curves to read off speed, stability and final performance.

Reflection

What rebuilding it taught me