08 · Deep Q-Networks
Q-Learning + neural net = DQN. Replace the Q-table with a network Q(s, a; θ). To prevent divergence: store transitions in a replay buffer, sample mini-batches uniformly, and bootstrap from a slow-moving target network.
Loss: L(θ) = 𝔼(s,a,r,s′)∼D [ ( r + γ maxa′ Q(s′,a′; θ⁻) − Q(s,a; θ) )² ]
Update: θ ← θ − α · ∇θ L ·· every C steps: θ⁻ ← θ
Update: θ ← θ − α · ∇θ L ·· every C steps: θ⁻ ← θ
▶ live demo — DQN on Mountain Car (2-D state, 3 actions)
An under-powered car in a valley. State = (position, velocity). Three actions: push left / coast / push right. Goal = reach the flag on the right. Solving needs to swing back-and-forth to build momentum — Q-Learning has to learn that nuance.
Total steps0
Episodes0
Best return—
Avg G (20)—
ε now1.00
Replay size0
Loss—
Environment
Q-values heatmap (max over a)
x = position, y = velocity. Brighter = higher value. Arrows show greedy action.
Return per episode & loss
Return G
Loss (log)
Why these tricks?
| Replay buffer | Breaks correlation between consecutive samples (which causes wild gradient updates). Lets us reuse data — sample efficiency. |
| Target network | The TD target is moving — if you bootstrap off the same net you're updating, you chase your own tail. Freezing θ⁻ for C steps stabilises learning. |
| ε decay | Explore early, exploit late. Linear or exponential schedule. |
| Double DQN | Decouple action selection from evaluation: a* = argmax Q(s′, ·; θ), target = Q(s′, a*; θ⁻). Reduces over-estimation. |
Be patient. Mountain Car is hard for DQN — reward is sparse (only at the flag). A random policy almost never reaches the flag, so the Q-net has nothing to learn from initially. Click Eval rollout periodically to see if the policy has "got it".