Worked example · Python

From Tabular Q-Learning to DQN
on a Control Task

One project, two algorithms, one clean comparison. We solve a discrete gridworld (FrozenLake) with tabular Q-learning using nothing but NumPy, then scale the very same control idea up to continuous state with a Deep Q-Network on CartPole using PyTorch — and measure what we actually gain (and lose) by going deep.

StackPython · NumPy · PyTorch

EnvironmentsFrozenLake · CartPole

Sessions exercised4–11, 13–15, 21

DifficultyBasic → Intermediate

Est. effort6–8 hours

1 · Overview

The single most important transition in a first RL course is the jump from tabular methods — where every state–action value lives in an array you can print — to function approximation, where a neural network estimates those values for states you have never even visited. This project makes that transition concrete on a single control problem.

Goal. Learn a control policy that maximises expected discounted return, first in a tiny discrete world and then in a continuous one, and answer two practical questions a practitioner always asks:

Sample efficiency — how many environment steps until the policy is good?
Stability — does training converge smoothly, or oscillate / diverge?

We deliberately keep the algorithmic core identical across both parts — a temporal-difference Q-update with ε-greedy exploration — so that every difference we observe is attributable to the representation (a table vs. a network), not to a different learning rule.

How this maps to the syllabus. Part 3 (tabular Q-learning) is the core of Sessions 9–10 (Temporal Differences; A First Practice). Part 4 (DQN) realises Sessions 15 (Approximation Methods), and 21 (Deep RL Practice), and leans on the MDP/Bellman machinery of Sessions 5–7. See §7 — mapping to learning outcomes.

Related interactive demos for each idea below:

ε-greedy & exploration → MDPs & Gridworld → SARSA / Q-Learning → Function Approximation → Deep Q-Networks → Course program →

2 · The RL setup

Both tasks are finite-horizon episodic control problems modelled as a Markov Decision Process (MDP) — exactly the object introduced in Session 5.

MDP: ⟨ 𝒮, 𝒜, P, R, γ ⟩
P(s′ | s, a) = Pr{ S_t+1 = s′ | S_t = s, A_t = a }
R(s, a) = 𝔼[ R_t+1 | S_t = s, A_t = a ], γ ∈ [0, 1)

The agent seeks a policy π(a | s) maximising the expected discounted return from each state:

G_t = Σ_k=0^∞ γ^k R_t+k+1
v_π(s) = 𝔼_π[ G_t | S_t = s ] (state-value)
q_π(s, a) = 𝔼_π[ G_t | S_t = s, A_t = a ] (action-value)

The whole project is built around the action-value function q(s, a), because once we have a good estimate of it the greedy policy is trivial: π(s) = argmax_a q(s, a). The optimal q* satisfies the Bellman optimality equation (Session 9):

q*(s, a) = 𝔼[ R_t+1 + γ max_a′ q*(S_t+1, a′) | S_t = s, A_t = a ]

Exploration. A purely greedy agent that always picks argmax_a q(s,a) may lock onto a bad estimate and never discover a better action. We resolve this exploration–exploitation tension (Session 4, bandits) with an ε-greedy policy and an annealing schedule that explores early and exploits late:

π(a | s) = { 1 − ε + ε/|𝒜| if a = argmax_a′ q(s, a′); ε/|𝒜| otherwise }
ε_t = ε_end + (ε_start − ε_end) · e^{−t / τ} (exponential decay)

3 · Tabular Q-learning (FrozenLake)

FrozenLake-v1 is a 4×4 gridworld: the agent starts top-left, must reach the goal (bottom-right) without stepping into a hole. Reward is +1 on reaching the goal and 0 otherwise — a sparse-reward control task with 16 discrete states and 4 actions. We use the deterministic variant (is_slippery=False) so the learning signal is clean; flip it to True for a stochastic challenge.

Q-learning is an off-policy TD control method: it learns q* while behaving ε-greedily. The update bootstraps the current estimate toward the Bellman target (Session 9):

Q(S_t, A_t) ← Q(S_t, A_t) + α [ R_t+1 + γ max_a Q(S_t+1, a) − Q(S_t, A_t) ]
δ_t = R_t+1 + γ max_a Q(S_t+1, a) − Q(S_t, A_t) (TD error)

Note the max over next-state actions: the target uses the best action regardless of what ε-greedy actually does next. That is precisely what makes it off-policy and what makes it learn q* rather than q_π.

Algorithm (pseudocode)

initialise Q(s, a) = 0 for all s, a
for each episode:
  s ← env.reset()
  repeat until terminal:
    a ← ε-greedy(Q, s)
    s′, r, done ← env.step(a)
    target ← r + γ · (0 if done else max_a′ Q(s′, a′))
    Q(s, a) ← Q(s, a) + α · (target − Q(s, a))
    s ← s′
  decay ε

q_learning.py — pure NumPy, no framework

import numpy as np
import gymnasium as gym


def epsilon_greedy(Q, state, eps, n_actions, rng):
    # With prob eps explore uniformly; otherwise act greedily.
    if rng.random() < eps:
        return rng.integers(n_actions)
    # argmax with random tie-breaking so ties don't bias one action
    q = Q[state]
    return rng.choice(np.flatnonzero(q == q.max()))


def train_q_learning(env, episodes=5000, alpha=0.1, gamma=0.99,
                     eps_start=1.0, eps_end=0.05, eps_decay=0.001, seed=0):
    rng = np.random.default_rng(seed)
    n_states  = env.observation_space.n
    n_actions = env.action_space.n
    Q = np.zeros((n_states, n_actions), dtype=np.float64)
    returns = []

    for ep in range(episodes):
        eps = eps_end + (eps_start - eps_end) * np.exp(-eps_decay * ep)
        state, _ = env.reset(seed=int(rng.integers(1 << 31)))
        done, G = False, 0.0

        while not done:
            action = epsilon_greedy(Q, state, eps, n_actions, rng)
            nxt, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Bellman target: no bootstrap past a terminal state.
            best_next = 0.0 if terminated else Q[nxt].max()
            td_target = reward + gamma * best_next
            td_error  = td_target - Q[state, action]
            Q[state, action] += alpha * td_error          # the Q-learning update

            state, G = nxt, G + reward
        returns.append(G)

    return Q, returns


if __name__ == "__main__":
    env = gym.make("FrozenLake-v1", is_slippery=False)
    Q, returns = train_q_learning(env)

    # Greedy policy and its success rate over 1000 fresh episodes.
    policy = Q.argmax(axis=1)
    wins = 0
    for _ in range(1000):
        s, _ = env.reset()
        done = False
        while not done:
            s, r, term, trunc, _ = env.step(int(policy[s]))
            done = term or trunc
        wins += int(r == 1.0)
    print(f"greedy success rate: {wins / 1000:.1%}")

The learning curve

Plotting a 100-episode moving average of return shows the classic tabular signature: a long, near-flat stretch while ε is high and the sparse reward is rarely hit, a sharp climb once the agent stumbles onto the goal and the value back-propagates along the path, then a clean plateau at the optimum. Because the table has one independent cell per (s,a), there is no interference — once a cell is correct it stays correct.

success rate 1.0 | ............________________ | ......./ 0.8 | .../ | ../ 0.6 | ../ | ./ 0.4 | ./ | ../ 0.2 | ../ | ......./ 0.0 |__________________________________________________________ 0 1k 2k 3k 4k 5k episodes (moving avg of episodic return, deterministic FrozenLake)

On the deterministic map the greedy policy reaches 100% success after roughly 1.5–2k episodes (≈ a few thousand environment steps). The full 4×16 Q-table can be printed and inspected by hand — the entire benefit, and the entire limitation, of tabular RL.

4 · Deep Q-Network (CartPole)

CartPole-v1 has a continuous 4-D state — cart position, cart velocity, pole angle, pole angular velocity — and 2 actions (push left / right). A table is impossible: there are infinitely many states. We replace Q(s, a) with a neural network Q(s, a; θ) that generalises across nearby states (Session 15), and train it to satisfy the same Bellman target.

Naively bootstrapping a network off itself diverges — the famous "deadly triad" of function approximation + bootstrapping + off-policy learning. DQN (Session 21) tames it with two ideas:

Trick	What it fixes
Replay buffer	Consecutive transitions are highly correlated; SGD assumes i.i.d. samples. Storing transitions and sampling random mini-batches decorrelates updates and reuses data (sample efficiency).
Target network	The TD target itself depends on θ. Bootstrapping off the same weights you are updating chases a moving target. A frozen copy θ⁻, synced every `C` steps, holds the target still.

The loss is the mean-squared TD error over a mini-batch B sampled from the replay buffer D:

y_i = r_i + γ (1 − done_i) max_a′ Q(s′_i, a′; θ⁻) (target, uses θ⁻)
L(θ) = (1/|B|) Σ_i∈B [ y_i − Q(s_i, a_i; θ) ]²
θ ← θ − α ∇_θ L(θ) ·· every C steps: θ⁻ ← θ

Algorithm (pseudocode)

initialise θ, θ⁻ ← θ, empty replay buffer D
for each step:
  a ← ε-greedy(Q(·; θ), s); s′, r, done ← env.step(a)
  store (s, a, r, s′, done) in D; s ← s′
  sample mini-batch B ∼ D
  y ← r + γ (1 − done) max_a′ Q(s′, a′; θ⁻)
  θ ← θ − α ∇_θ (y − Q(s, a; θ))²
  every C steps: θ⁻ ← θ;   decay ε

dqn.py — PyTorch + Gymnasium

import random
from collections import deque

import numpy as np
import torch
import torch.nn as nn
import gymnasium as gym


class QNet(nn.Module):
    """Maps a state to one Q-value per action: s -> [Q(s,a0), Q(s,a1)]."""
    def __init__(self, n_obs, n_act, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_obs, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, n_act),
        )

    def forward(self, x):
        return self.net(x)


class ReplayBuffer:
    def __init__(self, capacity=50_000):
        self.buf = deque(maxlen=capacity)

    def push(self, *transition):
        self.buf.append(transition)            # (s, a, r, s', done)

    def sample(self, batch_size):
        batch = random.sample(self.buf, batch_size)
        s, a, r, s2, d = zip(*batch)
        return (torch.as_tensor(np.array(s), dtype=torch.float32),
                torch.as_tensor(a, dtype=torch.int64).unsqueeze(1),
                torch.as_tensor(r, dtype=torch.float32).unsqueeze(1),
                torch.as_tensor(np.array(s2), dtype=torch.float32),
                torch.as_tensor(d, dtype=torch.float32).unsqueeze(1))

    def __len__(self):
        return len(self.buf)


def train_dqn(env_id="CartPole-v1", total_steps=60_000, gamma=0.99,
              lr=1e-3, batch_size=64, warmup=1_000, target_sync=500,
              eps_start=1.0, eps_end=0.05, eps_decay=10_000, seed=0):
    torch.manual_seed(seed); random.seed(seed); np.random.seed(seed)
    env = gym.make(env_id)
    n_obs = env.observation_space.shape[0]
    n_act = env.action_space.n

    online = QNet(n_obs, n_act)
    target = QNet(n_obs, n_act)
    target.load_state_dict(online.state_dict())   # theta_minus <- theta
    target.eval()

    optim = torch.optim.Adam(online.parameters(), lr=lr)
    buffer = ReplayBuffer()
    returns, ep_return = [], 0.0

    state, _ = env.reset(seed=seed)
    for step in range(1, total_steps + 1):
        eps = eps_end + (eps_start - eps_end) * np.exp(-step / eps_decay)

        # --- act (epsilon-greedy on the online network) ---
        if random.random() < eps:
            action = env.action_space.sample()
        else:
            with torch.no_grad():
                q = online(torch.as_tensor(state, dtype=torch.float32))
                action = int(q.argmax().item())

        nxt, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        buffer.push(state, action, reward, nxt, float(terminated))
        state, ep_return = nxt, ep_return + reward

        if done:
            returns.append(ep_return)
            state, _ = env.reset()
            ep_return = 0.0

        # --- learn (one gradient step once the buffer has warmed up) ---
        if len(buffer) >= warmup:
            s, a, r, s2, d = buffer.sample(batch_size)

            q_sa = online(s).gather(1, a)                      # Q(s,a; theta)
            with torch.no_grad():
                max_q_next = target(s2).max(dim=1, keepdim=True).values
                y = r + gamma * (1.0 - d) * max_q_next          # TD target via theta_minus

            loss = nn.functional.smooth_l1_loss(q_sa, y)        # Huber: robust to outliers
            optim.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(online.parameters(), 10.0)
            optim.step()

        # --- periodically sync the target network ---
        if step % target_sync == 0:
            target.load_state_dict(online.state_dict())

    return online, returns


if __name__ == "__main__":
    net, returns = train_dqn()
    last100 = np.mean(returns[-100:])
    print(f"mean return over last 100 episodes: {last100:.1f} (solved = 475)")

One subtlety worth its own callout. The target uses (1 − terminated), not (1 − done). CartPole's truncated flag fires at the 500-step time limit — but the pole has not fallen, so the value of the next state is genuinely non-zero. Zeroing it out there teaches the agent the episode "ends badly" at 500 steps, which silently caps performance. Bootstrap past truncation, stop only at true termination.

5 · Results & comparison

CartPole's return is the number of steps the pole stays up (max 500); the task is considered "solved" at a 100-episode average ≥ 475. The DQN reward curve is far noisier than the tabular one — long stretches of improvement punctuated by sudden collapses where a stale target or an unlucky batch destabilises the value estimates, followed by recovery.

return (steps) 500 | . .. ......___.. ___......._____ | . . :. :: : : : : 400 | . :: :: : | .. :: 300 | .::: | .:: 200 | ..:: | ..: 100 | .....: |..: 0 |__________________________________________________________ 0 10k 20k 30k 40k 50k 60k env steps (per-episode return; note the variance vs. the tabular curve)

Dimension	Tabular Q-learning	DQN
Environment	FrozenLake (16 states, discrete)	CartPole (4-D continuous state)
Representation	Q-table, 16×4 = 64 numbers	MLP, ≈ 17k parameters
Generalisation	None — each (s,a) independent	Yes — shares structure across states
Sample efficiency	High per state, but cannot scale beyond tiny state spaces	Lower per step; replay reuses data to compensate
Stability	Provably converges (under standard step-size conditions)	No convergence guarantee; needs replay + target net + clipping
Learning curve	Monotone-ish, clean plateau	Noisy, prone to transient collapses
Steps to "solve"	≈ 2–4k env steps (det. map)	≈ 30–50k env steps
Wall-clock / step	Microseconds (array write)	Milliseconds (forward + backward pass)
Interpretability	Full — print the table	Opaque — values are network outputs

The headline. Going deep buys you scalability to continuous / high-dimensional states — the only reason to pay for it — at the cost of stability and sample efficiency. On a problem small enough for a table, a table wins on every axis. The art of practical RL is knowing which regime you are in.

6 · Pitfalls & hyperparameters

ε decays too fast. If exploration collapses before the agent ever reaches the goal/keeps the pole up, it has nothing to learn from. Sparse-reward FrozenLake especially needs a long exploratory phase.
Learning rate too high (DQN). Large lr with a moving target diverges to NaN value estimates. Pair a modest lr=1e-3 with gradient clipping and a Huber loss.
Target sync too frequent. Small C reintroduces the moving-target instability; too large C makes targets stale and learning sluggish. C ≈ 500–1000 steps is a sane default for CartPole.
Replay buffer too small. A tiny buffer keeps samples correlated and recent, defeating its purpose; too large can over-weight obsolete early-policy transitions. 10k–100k is typical.
Forgetting to detach the target. Gradients must not flow through max_a' Q(s', a'; θ⁻). The torch.no_grad() block (or .detach()) is mandatory — omitting it is the single most common DQN bug.
Bootstrapping past true terminals. Multiply the next-state value by (1 − terminated) so a terminal state contributes only its immediate reward.
Over-estimation bias. The max operator systematically over-estimates Q-values. Double DQN (see §8) fixes this cheaply.
γ too low. A small discount makes the agent myopic and unable to value long-horizon goals like reaching a distant FrozenLake cell. Use γ ≈ 0.99.

7 · Mapping to learning outcomes

This project deliberately exercises a connected slice of the syllabus rather than a single session:

Course session	Exercised in this project
S4 · Multi-Armed Bandits	ε-greedy exploration and the exploration–exploitation trade-off, reused in both agents.
S5–S6 · MDPs	Both tasks framed as ⟨𝒮,𝒜,P,R,γ⟩; states, actions, discounting.
S7 · Dynamic Programming	Bellman optimality equation — the fixed point both methods approximate.
S9 · Temporal Differences	The Q-learning TD update and TD error δ are the algorithmic core.
S10 · A First Practice	End-to-end implementation + comparison of methods on a task — exactly this deliverable.
S11 · Learning & Planning	Both agents are model-free (learn from samples, no transition model).
S13 · RL Frameworks	Gymnasium environment API (`reset`/`step`); PyTorch for the network.
S15 · Approximation Methods	The leap from a table to an ANN value approximator `Q(s,a;θ)`.
S21 · Deep RL Practice	Replay buffer + target network + the DQN loss, end to end.

Per the syllabus's assignment expectations (Session 10/21), the deliverable is a Python implementation plus an analysis of the results — the comparison table and curve discussion in §5 are that analysis.

8 · Extensions

Double DQN. Decouple action selection from evaluation to cut over-estimation:
a* = argmax_a′ Q(s′, a′; θ); y = r + γ (1−done) Q(s′, a*; θ⁻)
A two-line change: pick the argmax with the online net, read its value from the target net.
Dueling DQN. Split the head into a state-value stream V(s) and an advantage stream A(s,a), then recombine as Q = V + (A − mean_a A). Learns which states are valuable independently of the action.
Prioritised replay. Sample transitions with probability proportional to |TD error| so surprising transitions are replayed more often.
Policy gradient (REINFORCE). Instead of learning values and acting greedily, parameterise π(a|s;θ) directly and ascend ∇_θ 𝔼[G]. See the REINFORCE demo and Session 16.
A2C / Actor-Critic. Combine a policy (actor) with a learned value baseline (critic) to slash gradient variance — the natural next step (Session 17). See the Actor-Critic demo.
Stochastic FrozenLake. Flip is_slippery=True and re-run the tabular agent: the optimal policy now hugs walls to avoid being blown into holes — a vivid lesson in planning under transition noise.

9 · References

Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. ISBN 9780262039246. — Ch. 6 (Temporal-Difference Learning, the Q-learning update), Ch. 3–4 (MDPs & the Bellman optimality equation), Ch. 9 (on-policy prediction with approximation). The course's compulsory text.
Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning." Nature 518, 529–533. — The DQN paper: replay buffer + target network.
van Hasselt, H., Guez, A. & Silver, D. (2016). "Deep Reinforcement Learning with Double Q-learning." AAAI. — Double DQN.
Wang, Z. et al. (2016). "Dueling Network Architectures for Deep Reinforcement Learning." ICML.
Towers, M. et al. (2024). Gymnasium (the maintained successor to OpenAI Gym) — environment API used above.

← Course Outline · All Demos · DQN demo →

From Tabular Q-Learning to DQNon a Control Task

1 · Overview

2 · The RL setup

3 · Tabular Q-learning (FrozenLake)

Algorithm (pseudocode)

The learning curve

4 · Deep Q-Network (CartPole)

Algorithm (pseudocode)

5 · Results & comparison

6 · Pitfalls & hyperparameters

7 · Mapping to learning outcomes

8 · Extensions

9 · References

From Tabular Q-Learning to DQN
on a Control Task