Reinforcement Learning,
Made Tangible.
Every core concept from the course — from bandits to actor-critic — rendered as a live demo you can poke, tweak, and watch learn in real time.
The Foundations
An agent acts in an environment. The environment returns a state and a reward. The agent's job: learn a policy π(a|s) that maximises expected cumulative return G = Σ γᵏ r. Every page below explores one slice of how that learning happens.
Multi-Armed Bandits
The exploration–exploitation dilemma in its purest form. ε-greedy, UCB, Thompson sampling, gradient bandits — pull arms and watch regret accumulate.
exploration · regret 02MDPs & Gridworld
States, actions, transitions, rewards. Draw your own gridworld, place rewards and walls, and see the Bellman equation in action.
states · actions · γ 03Dynamic Programming
Value Iteration and Policy Iteration. Sweep the Bellman optimality update across the grid and watch v* and π* converge.
bellman · planning 04Monte Carlo & TD Learning
Model-free: SARSA, Q-Learning, Expected SARSA, Monte Carlo control. The cliff-walking classic, side-by-side.
sarsa · q-learning · α 05Function Approximation
Tabular doesn't scale. Watch tile-coding, RBFs, and polynomial features approximate a value function with SGD.
features · sgd · generalisation 06Policy Gradients (REINFORCE)
The namesake algorithm. Sample full episodes, scale log-probabilities by returns, and steer the policy directly.
∇log π · variance 07Actor-Critic (A2C)
Add a learned baseline. Two networks: one acts, one critiques. See how variance drops compared to vanilla REINFORCE.
baseline · advantage 08Deep Q-Networks
Neural Q-function for continuous state spaces. Replay buffer, target network, ε-decay — the trio that beat Atari.
replay · target net 09Evolutionary Methods
No gradients, no problem. A genetic algorithm evolves a population of policies to balance a pole.
selection · mutation 10Multi-Agent RL
The Axelrod tournament. Pit Tit-for-Tat against Always Defect, Grim, Pavlov, and friends in iterated prisoner's dilemma.
game theory · cooperationHow to use this site
Nothing is precomputed. Every chart on every page is a real algorithm running in your browser. Source: vanilla JS, no frameworks.