An interactive RL playground

Reinforcement Learning,
Made Tangible.

Every core concept from the course — from bandits to actor-critic — rendered as a live demo you can poke, tweak, and watch learn in real time.

The Foundations

An agent acts in an environment. The environment returns a state and a reward. The agent's job: learn a policy π(a|s) that maximises expected cumulative return G = Σ γᵏ r. Every page below explores one slice of how that learning happens.

Multi-Armed Bandits

The exploration–exploitation dilemma in its purest form. ε-greedy, UCB, Thompson sampling, gradient bandits — pull arms and watch regret accumulate.

exploration · regret 02

MDPs & Gridworld

States, actions, transitions, rewards. Draw your own gridworld, place rewards and walls, and see the Bellman equation in action.

states · actions · γ 03

Dynamic Programming

Value Iteration and Policy Iteration. Sweep the Bellman optimality update across the grid and watch v* and π* converge.

bellman · planning 04

Monte Carlo & TD Learning

Model-free: SARSA, Q-Learning, Expected SARSA, Monte Carlo control. The cliff-walking classic, side-by-side.

sarsa · q-learning · α 05

Function Approximation

Tabular doesn't scale. Watch tile-coding, RBFs, and polynomial features approximate a value function with SGD.

features · sgd · generalisation 06

Policy Gradients (REINFORCE)

The namesake algorithm. Sample full episodes, scale log-probabilities by returns, and steer the policy directly.

∇log π · variance 07

Actor-Critic (A2C)

Add a learned baseline. Two networks: one acts, one critiques. See how variance drops compared to vanilla REINFORCE.

baseline · advantage 08

Deep Q-Networks

Neural Q-function for continuous state spaces. Replay buffer, target network, ε-decay — the trio that beat Atari.

replay · target net 09

Evolutionary Methods

No gradients, no problem. A genetic algorithm evolves a population of policies to balance a pole.

selection · mutation 10

Multi-Agent RL

The Axelrod tournament. Pit Tit-for-Tat against Always Defect, Grim, Pavlov, and friends in iterated prisoner's dilemma.

game theory · cooperation

How to use this site

Each page is self-contained. Open it, hit Run, and play with the sliders. The maths panel on each demo shows the exact update rule being applied — change a parameter and watch its effect immediately.

Nothing is precomputed. Every chart on every page is a real algorithm running in your browser. Source: vanilla JS, no frameworks.

Reinforcement Learning,Made Tangible.