From Tabular Q-Learning to DQN
on a Control Task
One project, two algorithms, one clean comparison. We solve a discrete gridworld (FrozenLake) with tabular Q-learning using nothing but NumPy, then scale the very same control idea up to continuous state with a Deep Q-Network on CartPole using PyTorch — and measure what we actually gain (and lose) by going deep.
1 · Overview
The single most important transition in a first RL course is the jump from tabular methods — where every state–action value lives in an array you can print — to function approximation, where a neural network estimates those values for states you have never even visited. This project makes that transition concrete on a single control problem.
Goal. Learn a control policy that maximises expected discounted return, first in a tiny discrete world and then in a continuous one, and answer two practical questions a practitioner always asks:
- Sample efficiency — how many environment steps until the policy is good?
- Stability — does training converge smoothly, or oscillate / diverge?
We deliberately keep the algorithmic core identical across both parts — a temporal-difference Q-update with ε-greedy exploration — so that every difference we observe is attributable to the representation (a table vs. a network), not to a different learning rule.
Related interactive demos for each idea below:
2 · The RL setup
Both tasks are finite-horizon episodic control problems modelled as a Markov Decision Process (MDP) — exactly the object introduced in Session 5.
P(s′ | s, a) = Pr{ St+1 = s′ | St = s, At = a }
R(s, a) = 𝔼[ Rt+1 | St = s, At = a ], γ ∈ [0, 1)
The agent seeks a policy π(a | s) maximising the expected discounted return from each state:
vπ(s) = 𝔼π[ Gt | St = s ] (state-value)
qπ(s, a) = 𝔼π[ Gt | St = s, At = a ] (action-value)
The whole project is built around the action-value function q(s, a), because once we have a good estimate of it the greedy policy is trivial: π(s) = argmax_a q(s, a). The optimal q* satisfies the Bellman optimality equation (Session 9):
Exploration. A purely greedy agent that always picks argmax_a q(s,a) may lock onto a bad estimate and never discover a better action. We resolve this exploration–exploitation tension (Session 4, bandits) with an ε-greedy policy and an annealing schedule that explores early and exploits late:
εt = εend + (εstart − εend) · e−t / τ (exponential decay)
3 · Tabular Q-learning (FrozenLake)
FrozenLake-v1 is a 4×4 gridworld: the agent starts top-left, must reach the goal (bottom-right) without stepping into a hole. Reward is +1 on reaching the goal and 0 otherwise — a sparse-reward control task with 16 discrete states and 4 actions. We use the deterministic variant (is_slippery=False) so the learning signal is clean; flip it to True for a stochastic challenge.
Q-learning is an off-policy TD control method: it learns q* while behaving ε-greedily. The update bootstraps the current estimate toward the Bellman target (Session 9):
δt = Rt+1 + γ maxa Q(St+1, a) − Q(St, At) (TD error)
Note the max over next-state actions: the target uses the best action regardless of what ε-greedy actually does next. That is precisely what makes it off-policy and what makes it learn q* rather than qπ.
Algorithm (pseudocode)
for each episode:
s ← env.reset()
repeat until terminal:
a ← ε-greedy(Q, s)
s′, r, done ← env.step(a)
target ← r + γ · (0 if done else maxa′ Q(s′, a′))
Q(s, a) ← Q(s, a) + α · (target − Q(s, a))
s ← s′
decay ε
import numpy as np
import gymnasium as gym
def epsilon_greedy(Q, state, eps, n_actions, rng):
# With prob eps explore uniformly; otherwise act greedily.
if rng.random() < eps:
return rng.integers(n_actions)
# argmax with random tie-breaking so ties don't bias one action
q = Q[state]
return rng.choice(np.flatnonzero(q == q.max()))
def train_q_learning(env, episodes=5000, alpha=0.1, gamma=0.99,
eps_start=1.0, eps_end=0.05, eps_decay=0.001, seed=0):
rng = np.random.default_rng(seed)
n_states = env.observation_space.n
n_actions = env.action_space.n
Q = np.zeros((n_states, n_actions), dtype=np.float64)
returns = []
for ep in range(episodes):
eps = eps_end + (eps_start - eps_end) * np.exp(-eps_decay * ep)
state, _ = env.reset(seed=int(rng.integers(1 << 31)))
done, G = False, 0.0
while not done:
action = epsilon_greedy(Q, state, eps, n_actions, rng)
nxt, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Bellman target: no bootstrap past a terminal state.
best_next = 0.0 if terminated else Q[nxt].max()
td_target = reward + gamma * best_next
td_error = td_target - Q[state, action]
Q[state, action] += alpha * td_error # the Q-learning update
state, G = nxt, G + reward
returns.append(G)
return Q, returns
if __name__ == "__main__":
env = gym.make("FrozenLake-v1", is_slippery=False)
Q, returns = train_q_learning(env)
# Greedy policy and its success rate over 1000 fresh episodes.
policy = Q.argmax(axis=1)
wins = 0
for _ in range(1000):
s, _ = env.reset()
done = False
while not done:
s, r, term, trunc, _ = env.step(int(policy[s]))
done = term or trunc
wins += int(r == 1.0)
print(f"greedy success rate: {wins / 1000:.1%}")
The learning curve
Plotting a 100-episode moving average of return shows the classic tabular signature: a long, near-flat stretch while ε is high and the sparse reward is rarely hit, a sharp climb once the agent stumbles onto the goal and the value back-propagates along the path, then a clean plateau at the optimum. Because the table has one independent cell per (s,a), there is no interference — once a cell is correct it stays correct.
On the deterministic map the greedy policy reaches 100% success after roughly 1.5–2k episodes (≈ a few thousand environment steps). The full 4×16 Q-table can be printed and inspected by hand — the entire benefit, and the entire limitation, of tabular RL.
4 · Deep Q-Network (CartPole)
CartPole-v1 has a continuous 4-D state — cart position, cart velocity, pole angle, pole angular velocity — and 2 actions (push left / right). A table is impossible: there are infinitely many states. We replace Q(s, a) with a neural network Q(s, a; θ) that generalises across nearby states (Session 15), and train it to satisfy the same Bellman target.
Naively bootstrapping a network off itself diverges — the famous "deadly triad" of function approximation + bootstrapping + off-policy learning. DQN (Session 21) tames it with two ideas:
| Trick | What it fixes |
|---|---|
| Replay buffer | Consecutive transitions are highly correlated; SGD assumes i.i.d. samples. Storing transitions and sampling random mini-batches decorrelates updates and reuses data (sample efficiency). |
| Target network | The TD target itself depends on θ. Bootstrapping off the same weights you are updating chases a moving target. A frozen copy θ⁻, synced every C steps, holds the target still. |
The loss is the mean-squared TD error over a mini-batch B sampled from the replay buffer D:
L(θ) = (1/|B|) Σi∈B [ yi − Q(si, ai; θ) ]2
θ ← θ − α ∇θ L(θ) ·· every C steps: θ⁻ ← θ
Algorithm (pseudocode)
for each step:
a ← ε-greedy(Q(·; θ), s); s′, r, done ← env.step(a)
store (s, a, r, s′, done) in D; s ← s′
sample mini-batch B ∼ D
y ← r + γ (1 − done) maxa′ Q(s′, a′; θ⁻)
θ ← θ − α ∇θ (y − Q(s, a; θ))2
every C steps: θ⁻ ← θ; decay ε
import random
from collections import deque
import numpy as np
import torch
import torch.nn as nn
import gymnasium as gym
class QNet(nn.Module):
"""Maps a state to one Q-value per action: s -> [Q(s,a0), Q(s,a1)]."""
def __init__(self, n_obs, n_act, hidden=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_obs, hidden), nn.ReLU(),
nn.Linear(hidden, hidden), nn.ReLU(),
nn.Linear(hidden, n_act),
)
def forward(self, x):
return self.net(x)
class ReplayBuffer:
def __init__(self, capacity=50_000):
self.buf = deque(maxlen=capacity)
def push(self, *transition):
self.buf.append(transition) # (s, a, r, s', done)
def sample(self, batch_size):
batch = random.sample(self.buf, batch_size)
s, a, r, s2, d = zip(*batch)
return (torch.as_tensor(np.array(s), dtype=torch.float32),
torch.as_tensor(a, dtype=torch.int64).unsqueeze(1),
torch.as_tensor(r, dtype=torch.float32).unsqueeze(1),
torch.as_tensor(np.array(s2), dtype=torch.float32),
torch.as_tensor(d, dtype=torch.float32).unsqueeze(1))
def __len__(self):
return len(self.buf)
def train_dqn(env_id="CartPole-v1", total_steps=60_000, gamma=0.99,
lr=1e-3, batch_size=64, warmup=1_000, target_sync=500,
eps_start=1.0, eps_end=0.05, eps_decay=10_000, seed=0):
torch.manual_seed(seed); random.seed(seed); np.random.seed(seed)
env = gym.make(env_id)
n_obs = env.observation_space.shape[0]
n_act = env.action_space.n
online = QNet(n_obs, n_act)
target = QNet(n_obs, n_act)
target.load_state_dict(online.state_dict()) # theta_minus <- theta
target.eval()
optim = torch.optim.Adam(online.parameters(), lr=lr)
buffer = ReplayBuffer()
returns, ep_return = [], 0.0
state, _ = env.reset(seed=seed)
for step in range(1, total_steps + 1):
eps = eps_end + (eps_start - eps_end) * np.exp(-step / eps_decay)
# --- act (epsilon-greedy on the online network) ---
if random.random() < eps:
action = env.action_space.sample()
else:
with torch.no_grad():
q = online(torch.as_tensor(state, dtype=torch.float32))
action = int(q.argmax().item())
nxt, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
buffer.push(state, action, reward, nxt, float(terminated))
state, ep_return = nxt, ep_return + reward
if done:
returns.append(ep_return)
state, _ = env.reset()
ep_return = 0.0
# --- learn (one gradient step once the buffer has warmed up) ---
if len(buffer) >= warmup:
s, a, r, s2, d = buffer.sample(batch_size)
q_sa = online(s).gather(1, a) # Q(s,a; theta)
with torch.no_grad():
max_q_next = target(s2).max(dim=1, keepdim=True).values
y = r + gamma * (1.0 - d) * max_q_next # TD target via theta_minus
loss = nn.functional.smooth_l1_loss(q_sa, y) # Huber: robust to outliers
optim.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(online.parameters(), 10.0)
optim.step()
# --- periodically sync the target network ---
if step % target_sync == 0:
target.load_state_dict(online.state_dict())
return online, returns
if __name__ == "__main__":
net, returns = train_dqn()
last100 = np.mean(returns[-100:])
print(f"mean return over last 100 episodes: {last100:.1f} (solved = 475)")
(1 − terminated), not (1 − done). CartPole's truncated flag fires at the 500-step time limit — but the pole has not fallen, so the value of the next state is genuinely non-zero. Zeroing it out there teaches the agent the episode "ends badly" at 500 steps, which silently caps performance. Bootstrap past truncation, stop only at true termination.5 · Results & comparison
CartPole's return is the number of steps the pole stays up (max 500); the task is considered "solved" at a 100-episode average ≥ 475. The DQN reward curve is far noisier than the tabular one — long stretches of improvement punctuated by sudden collapses where a stale target or an unlucky batch destabilises the value estimates, followed by recovery.
| Dimension | Tabular Q-learning | DQN |
|---|---|---|
| Environment | FrozenLake (16 states, discrete) | CartPole (4-D continuous state) |
| Representation | Q-table, 16×4 = 64 numbers | MLP, ≈ 17k parameters |
| Generalisation | None — each (s,a) independent | Yes — shares structure across states |
| Sample efficiency | High per state, but cannot scale beyond tiny state spaces | Lower per step; replay reuses data to compensate |
| Stability | Provably converges (under standard step-size conditions) | No convergence guarantee; needs replay + target net + clipping |
| Learning curve | Monotone-ish, clean plateau | Noisy, prone to transient collapses |
| Steps to "solve" | ≈ 2–4k env steps (det. map) | ≈ 30–50k env steps |
| Wall-clock / step | Microseconds (array write) | Milliseconds (forward + backward pass) |
| Interpretability | Full — print the table | Opaque — values are network outputs |
The headline. Going deep buys you scalability to continuous / high-dimensional states — the only reason to pay for it — at the cost of stability and sample efficiency. On a problem small enough for a table, a table wins on every axis. The art of practical RL is knowing which regime you are in.
6 · Pitfalls & hyperparameters
- ε decays too fast. If exploration collapses before the agent ever reaches the goal/keeps the pole up, it has nothing to learn from. Sparse-reward FrozenLake especially needs a long exploratory phase.
- Learning rate too high (DQN). Large
lrwith a moving target diverges to NaN value estimates. Pair a modestlr=1e-3with gradient clipping and a Huber loss. - Target sync too frequent. Small
Creintroduces the moving-target instability; too largeCmakes targets stale and learning sluggish.C ≈ 500–1000steps is a sane default for CartPole. - Replay buffer too small. A tiny buffer keeps samples correlated and recent, defeating its purpose; too large can over-weight obsolete early-policy transitions. 10k–100k is typical.
- Forgetting to detach the target. Gradients must not flow through
max_a' Q(s', a'; θ⁻). Thetorch.no_grad()block (or.detach()) is mandatory — omitting it is the single most common DQN bug. - Bootstrapping past true terminals. Multiply the next-state value by
(1 − terminated)so a terminal state contributes only its immediate reward. - Over-estimation bias. The
maxoperator systematically over-estimates Q-values. Double DQN (see §8) fixes this cheaply. - γ too low. A small discount makes the agent myopic and unable to value long-horizon goals like reaching a distant FrozenLake cell. Use
γ ≈ 0.99.
7 · Mapping to learning outcomes
This project deliberately exercises a connected slice of the syllabus rather than a single session:
| Course session | Exercised in this project |
|---|---|
| S4 · Multi-Armed Bandits | ε-greedy exploration and the exploration–exploitation trade-off, reused in both agents. |
| S5–S6 · MDPs | Both tasks framed as ⟨𝒮,𝒜,P,R,γ⟩; states, actions, discounting. |
| S7 · Dynamic Programming | Bellman optimality equation — the fixed point both methods approximate. |
| S9 · Temporal Differences | The Q-learning TD update and TD error δ are the algorithmic core. |
| S10 · A First Practice | End-to-end implementation + comparison of methods on a task — exactly this deliverable. |
| S11 · Learning & Planning | Both agents are model-free (learn from samples, no transition model). |
| S13 · RL Frameworks | Gymnasium environment API (reset/step); PyTorch for the network. |
| S15 · Approximation Methods | The leap from a table to an ANN value approximator Q(s,a;θ). |
| S21 · Deep RL Practice | Replay buffer + target network + the DQN loss, end to end. |
Per the syllabus's assignment expectations (Session 10/21), the deliverable is a Python implementation plus an analysis of the results — the comparison table and curve discussion in §5 are that analysis.
8 · Extensions
- Double DQN. Decouple action selection from evaluation to cut over-estimation:
a* = argmaxa′ Q(s′, a′; θ); y = r + γ (1−done) Q(s′, a*; θ⁻)A two-line change: pick the argmax with the online net, read its value from the target net.
- Dueling DQN. Split the head into a state-value stream V(s) and an advantage stream A(s,a), then recombine as
Q = V + (A − mean_a A). Learns which states are valuable independently of the action. - Prioritised replay. Sample transitions with probability proportional to |TD error| so surprising transitions are replayed more often.
- Policy gradient (REINFORCE). Instead of learning values and acting greedily, parameterise π(a|s;θ) directly and ascend ∇θ 𝔼[G]. See the REINFORCE demo and Session 16.
- A2C / Actor-Critic. Combine a policy (actor) with a learned value baseline (critic) to slash gradient variance — the natural next step (Session 17). See the Actor-Critic demo.
- Stochastic FrozenLake. Flip
is_slippery=Trueand re-run the tabular agent: the optimal policy now hugs walls to avoid being blown into holes — a vivid lesson in planning under transition noise.
9 · References
- Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press. ISBN 9780262039246. — Ch. 6 (Temporal-Difference Learning, the Q-learning update), Ch. 3–4 (MDPs & the Bellman optimality equation), Ch. 9 (on-policy prediction with approximation). The course's compulsory text.
- Mnih, V. et al. (2015). "Human-level control through deep reinforcement learning." Nature 518, 529–533. — The DQN paper: replay buffer + target network.
- van Hasselt, H., Guez, A. & Silver, D. (2016). "Deep Reinforcement Learning with Double Q-learning." AAAI. — Double DQN.
- Wang, Z. et al. (2016). "Dueling Network Architectures for Deep Reinforcement Learning." ICML.
- Towers, M. et al. (2024). Gymnasium (the maintained successor to OpenAI Gym) — environment API used above.