Worked example · Python

From Tabular Q-Learning to DQN
on a Control Task

One project, two algorithms, one clean comparison. We solve a discrete gridworld (FrozenLake) with tabular Q-learning using nothing but NumPy, then scale the very same control idea up to continuous state with a Deep Q-Network on CartPole using PyTorch — and measure what we actually gain (and lose) by going deep.

StackPython · NumPy · PyTorch
EnvironmentsFrozenLake · CartPole
Sessions exercised4–11, 13–15, 21
DifficultyBasic → Intermediate
Est. effort6–8 hours

1 · Overview

The single most important transition in a first RL course is the jump from tabular methods — where every state–action value lives in an array you can print — to function approximation, where a neural network estimates those values for states you have never even visited. This project makes that transition concrete on a single control problem.

Goal. Learn a control policy that maximises expected discounted return, first in a tiny discrete world and then in a continuous one, and answer two practical questions a practitioner always asks:

We deliberately keep the algorithmic core identical across both parts — a temporal-difference Q-update with ε-greedy exploration — so that every difference we observe is attributable to the representation (a table vs. a network), not to a different learning rule.

How this maps to the syllabus. Part 3 (tabular Q-learning) is the core of Sessions 9–10 (Temporal Differences; A First Practice). Part 4 (DQN) realises Sessions 15 (Approximation Methods), and 21 (Deep RL Practice), and leans on the MDP/Bellman machinery of Sessions 5–7. See §7 — mapping to learning outcomes.

Related interactive demos for each idea below:

2 · The RL setup

Both tasks are finite-horizon episodic control problems modelled as a Markov Decision Process (MDP) — exactly the object introduced in Session 5.

MDP:  ⟨ 𝒮, 𝒜, P, R, γ ⟩
P(s′ | s, a) = Pr{ St+1 = s′ | St = s, At = a }
R(s, a) = 𝔼[ Rt+1 | St = s, At = a ],   γ ∈ [0, 1)

The agent seeks a policy π(a | s) maximising the expected discounted return from each state:

Gt = Σk=0 γk Rt+k+1
vπ(s) = 𝔼π[ Gt | St = s ]     (state-value)
qπ(s, a) = 𝔼π[ Gt | St = s, At = a ]     (action-value)

The whole project is built around the action-value function q(s, a), because once we have a good estimate of it the greedy policy is trivial: π(s) = argmax_a q(s, a). The optimal q* satisfies the Bellman optimality equation (Session 9):

q*(s, a) = 𝔼[ Rt+1 + γ maxa′ q*(St+1, a′) | St = s, At = a ]

Exploration. A purely greedy agent that always picks argmax_a q(s,a) may lock onto a bad estimate and never discover a better action. We resolve this exploration–exploitation tension (Session 4, bandits) with an ε-greedy policy and an annealing schedule that explores early and exploits late:

π(a | s) = {  1 − ε + ε/|𝒜|  if a = argmaxa′ q(s, a′);   ε/|𝒜|  otherwise }
εt = εend + (εstart − εend) · e−t / τ     (exponential decay)

3 · Tabular Q-learning (FrozenLake)

FrozenLake-v1 is a 4×4 gridworld: the agent starts top-left, must reach the goal (bottom-right) without stepping into a hole. Reward is +1 on reaching the goal and 0 otherwise — a sparse-reward control task with 16 discrete states and 4 actions. We use the deterministic variant (is_slippery=False) so the learning signal is clean; flip it to True for a stochastic challenge.

Q-learning is an off-policy TD control method: it learns q* while behaving ε-greedily. The update bootstraps the current estimate toward the Bellman target (Session 9):

Q(St, At) ← Q(St, At) + α [ Rt+1 + γ maxa Q(St+1, a) − Q(St, At) ]
δt = Rt+1 + γ maxa Q(St+1, a) − Q(St, At)     (TD error)

Note the max over next-state actions: the target uses the best action regardless of what ε-greedy actually does next. That is precisely what makes it off-policy and what makes it learn q* rather than qπ.

Algorithm (pseudocode)

initialise Q(s, a) = 0 for all s, a
for each episode:
  s ← env.reset()
  repeat until terminal:
    a ← ε-greedy(Q, s)
    s′, r, done ← env.step(a)
    target ← r + γ · (0 if done else maxa′ Q(s′, a′))
    Q(s, a) ← Q(s, a) + α · (target − Q(s, a))
    s ← s′
  decay ε
q_learning.py — pure NumPy, no framework
import numpy as np
import gymnasium as gym


def epsilon_greedy(Q, state, eps, n_actions, rng):
    # With prob eps explore uniformly; otherwise act greedily.
    if rng.random() < eps:
        return rng.integers(n_actions)
    # argmax with random tie-breaking so ties don't bias one action
    q = Q[state]
    return rng.choice(np.flatnonzero(q == q.max()))


def train_q_learning(env, episodes=5000, alpha=0.1, gamma=0.99,
                     eps_start=1.0, eps_end=0.05, eps_decay=0.001, seed=0):
    rng = np.random.default_rng(seed)
    n_states  = env.observation_space.n
    n_actions = env.action_space.n
    Q = np.zeros((n_states, n_actions), dtype=np.float64)
    returns = []

    for ep in range(episodes):
        eps = eps_end + (eps_start - eps_end) * np.exp(-eps_decay * ep)
        state, _ = env.reset(seed=int(rng.integers(1 << 31)))
        done, G = False, 0.0

        while not done:
            action = epsilon_greedy(Q, state, eps, n_actions, rng)
            nxt, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Bellman target: no bootstrap past a terminal state.
            best_next = 0.0 if terminated else Q[nxt].max()
            td_target = reward + gamma * best_next
            td_error  = td_target - Q[state, action]
            Q[state, action] += alpha * td_error          # the Q-learning update

            state, G = nxt, G + reward
        returns.append(G)

    return Q, returns


if __name__ == "__main__":
    env = gym.make("FrozenLake-v1", is_slippery=False)
    Q, returns = train_q_learning(env)

    # Greedy policy and its success rate over 1000 fresh episodes.
    policy = Q.argmax(axis=1)
    wins = 0
    for _ in range(1000):
        s, _ = env.reset()
        done = False
        while not done:
            s, r, term, trunc, _ = env.step(int(policy[s]))
            done = term or trunc
        wins += int(r == 1.0)
    print(f"greedy success rate: {wins / 1000:.1%}")

The learning curve

Plotting a 100-episode moving average of return shows the classic tabular signature: a long, near-flat stretch while ε is high and the sparse reward is rarely hit, a sharp climb once the agent stumbles onto the goal and the value back-propagates along the path, then a clean plateau at the optimum. Because the table has one independent cell per (s,a), there is no interference — once a cell is correct it stays correct.

success rate 1.0 | ............________________ | ......./ 0.8 | .../ | ../ 0.6 | ../ | ./ 0.4 | ./ | ../ 0.2 | ../ | ......./ 0.0 |__________________________________________________________ 0 1k 2k 3k 4k 5k episodes (moving avg of episodic return, deterministic FrozenLake)

On the deterministic map the greedy policy reaches 100% success after roughly 1.5–2k episodes (≈ a few thousand environment steps). The full 4×16 Q-table can be printed and inspected by hand — the entire benefit, and the entire limitation, of tabular RL.

4 · Deep Q-Network (CartPole)

CartPole-v1 has a continuous 4-D state — cart position, cart velocity, pole angle, pole angular velocity — and 2 actions (push left / right). A table is impossible: there are infinitely many states. We replace Q(s, a) with a neural network Q(s, a; θ) that generalises across nearby states (Session 15), and train it to satisfy the same Bellman target.

Naively bootstrapping a network off itself diverges — the famous "deadly triad" of function approximation + bootstrapping + off-policy learning. DQN (Session 21) tames it with two ideas:

TrickWhat it fixes
Replay bufferConsecutive transitions are highly correlated; SGD assumes i.i.d. samples. Storing transitions and sampling random mini-batches decorrelates updates and reuses data (sample efficiency).
Target networkThe TD target itself depends on θ. Bootstrapping off the same weights you are updating chases a moving target. A frozen copy θ⁻, synced every C steps, holds the target still.

The loss is the mean-squared TD error over a mini-batch B sampled from the replay buffer D:

yi = ri + γ (1 − donei) maxa′ Q(s′i, a′; θ⁻)     (target, uses θ⁻)
L(θ) = (1/|B|) Σi∈B [ yi − Q(si, ai; θ) ]2
θ ← θ − α ∇θ L(θ)  ··  every C steps:  θ⁻ ← θ

Algorithm (pseudocode)

initialise θ, θ⁻ ← θ, empty replay buffer D
for each step:
  a ← ε-greedy(Q(·; θ), s);  s′, r, done ← env.step(a)
  store (s, a, r, s′, done) in D;  s ← s′
  sample mini-batch B ∼ D
  y ← r + γ (1 − done) maxa′ Q(s′, a′; θ⁻)
  θ ← θ − α ∇θ (y − Q(s, a; θ))2
  every C steps: θ⁻ ← θ;   decay ε
dqn.py — PyTorch + Gymnasium
import random
from collections import deque

import numpy as np
import torch
import torch.nn as nn
import gymnasium as gym


class QNet(nn.Module):
    """Maps a state to one Q-value per action: s -> [Q(s,a0), Q(s,a1)]."""
    def __init__(self, n_obs, n_act, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_obs, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, n_act),
        )

    def forward(self, x):
        return self.net(x)


class ReplayBuffer:
    def __init__(self, capacity=50_000):
        self.buf = deque(maxlen=capacity)

    def push(self, *transition):
        self.buf.append(transition)            # (s, a, r, s', done)

    def sample(self, batch_size):
        batch = random.sample(self.buf, batch_size)
        s, a, r, s2, d = zip(*batch)
        return (torch.as_tensor(np.array(s), dtype=torch.float32),
                torch.as_tensor(a, dtype=torch.int64).unsqueeze(1),
                torch.as_tensor(r, dtype=torch.float32).unsqueeze(1),
                torch.as_tensor(np.array(s2), dtype=torch.float32),
                torch.as_tensor(d, dtype=torch.float32).unsqueeze(1))

    def __len__(self):
        return len(self.buf)


def train_dqn(env_id="CartPole-v1", total_steps=60_000, gamma=0.99,
              lr=1e-3, batch_size=64, warmup=1_000, target_sync=500,
              eps_start=1.0, eps_end=0.05, eps_decay=10_000, seed=0):
    torch.manual_seed(seed); random.seed(seed); np.random.seed(seed)
    env = gym.make(env_id)
    n_obs = env.observation_space.shape[0]
    n_act = env.action_space.n

    online = QNet(n_obs, n_act)
    target = QNet(n_obs, n_act)
    target.load_state_dict(online.state_dict())   # theta_minus <- theta
    target.eval()

    optim = torch.optim.Adam(online.parameters(), lr=lr)
    buffer = ReplayBuffer()
    returns, ep_return = [], 0.0

    state, _ = env.reset(seed=seed)
    for step in range(1, total_steps + 1):
        eps = eps_end + (eps_start - eps_end) * np.exp(-step / eps_decay)

        # --- act (epsilon-greedy on the online network) ---
        if random.random() < eps:
            action = env.action_space.sample()
        else:
            with torch.no_grad():
                q = online(torch.as_tensor(state, dtype=torch.float32))
                action = int(q.argmax().item())

        nxt, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        buffer.push(state, action, reward, nxt, float(terminated))
        state, ep_return = nxt, ep_return + reward

        if done:
            returns.append(ep_return)
            state, _ = env.reset()
            ep_return = 0.0

        # --- learn (one gradient step once the buffer has warmed up) ---
        if len(buffer) >= warmup:
            s, a, r, s2, d = buffer.sample(batch_size)

            q_sa = online(s).gather(1, a)                      # Q(s,a; theta)
            with torch.no_grad():
                max_q_next = target(s2).max(dim=1, keepdim=True).values
                y = r + gamma * (1.0 - d) * max_q_next          # TD target via theta_minus

            loss = nn.functional.smooth_l1_loss(q_sa, y)        # Huber: robust to outliers
            optim.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(online.parameters(), 10.0)
            optim.step()

        # --- periodically sync the target network ---
        if step % target_sync == 0:
            target.load_state_dict(online.state_dict())

    return online, returns


if __name__ == "__main__":
    net, returns = train_dqn()
    last100 = np.mean(returns[-100:])
    print(f"mean return over last 100 episodes: {last100:.1f} (solved = 475)")
One subtlety worth its own callout. The target uses (1 − terminated), not (1 − done). CartPole's truncated flag fires at the 500-step time limit — but the pole has not fallen, so the value of the next state is genuinely non-zero. Zeroing it out there teaches the agent the episode "ends badly" at 500 steps, which silently caps performance. Bootstrap past truncation, stop only at true termination.

5 · Results & comparison

CartPole's return is the number of steps the pole stays up (max 500); the task is considered "solved" at a 100-episode average ≥ 475. The DQN reward curve is far noisier than the tabular one — long stretches of improvement punctuated by sudden collapses where a stale target or an unlucky batch destabilises the value estimates, followed by recovery.

return (steps) 500 | . .. ......___.. ___......._____ | . . :. :: : : : : 400 | . :: :: : | .. :: 300 | .::: | .:: 200 | ..:: | ..: 100 | .....: |..: 0 |__________________________________________________________ 0 10k 20k 30k 40k 50k 60k env steps (per-episode return; note the variance vs. the tabular curve)
Dimension Tabular Q-learning DQN
EnvironmentFrozenLake (16 states, discrete)CartPole (4-D continuous state)
RepresentationQ-table, 16×4 = 64 numbersMLP, ≈ 17k parameters
GeneralisationNone — each (s,a) independentYes — shares structure across states
Sample efficiencyHigh per state, but cannot scale beyond tiny state spacesLower per step; replay reuses data to compensate
StabilityProvably converges (under standard step-size conditions)No convergence guarantee; needs replay + target net + clipping
Learning curveMonotone-ish, clean plateauNoisy, prone to transient collapses
Steps to "solve"≈ 2–4k env steps (det. map)≈ 30–50k env steps
Wall-clock / stepMicroseconds (array write)Milliseconds (forward + backward pass)
InterpretabilityFull — print the tableOpaque — values are network outputs

The headline. Going deep buys you scalability to continuous / high-dimensional states — the only reason to pay for it — at the cost of stability and sample efficiency. On a problem small enough for a table, a table wins on every axis. The art of practical RL is knowing which regime you are in.

6 · Pitfalls & hyperparameters

7 · Mapping to learning outcomes

This project deliberately exercises a connected slice of the syllabus rather than a single session:

Course sessionExercised in this project
S4 · Multi-Armed Banditsε-greedy exploration and the exploration–exploitation trade-off, reused in both agents.
S5–S6 · MDPsBoth tasks framed as ⟨𝒮,𝒜,P,R,γ⟩; states, actions, discounting.
S7 · Dynamic ProgrammingBellman optimality equation — the fixed point both methods approximate.
S9 · Temporal DifferencesThe Q-learning TD update and TD error δ are the algorithmic core.
S10 · A First PracticeEnd-to-end implementation + comparison of methods on a task — exactly this deliverable.
S11 · Learning & PlanningBoth agents are model-free (learn from samples, no transition model).
S13 · RL FrameworksGymnasium environment API (reset/step); PyTorch for the network.
S15 · Approximation MethodsThe leap from a table to an ANN value approximator Q(s,a;θ).
S21 · Deep RL PracticeReplay buffer + target network + the DQN loss, end to end.

Per the syllabus's assignment expectations (Session 10/21), the deliverable is a Python implementation plus an analysis of the results — the comparison table and curve discussion in §5 are that analysis.

8 · Extensions

9 · References

← Course Outline  ·  All Demos  ·  DQN demo →