08 · Deep Q-Networks

Q-Learning + neural net = DQN. Replace the Q-table with a network Q(s, a; θ). To prevent divergence: store transitions in a replay buffer, sample mini-batches uniformly, and bootstrap from a slow-moving target network.

Loss:  L(θ) = 𝔼(s,a,r,s′)∼D [ ( r + γ maxa′ Q(s′,a′; θ⁻) − Q(s,a; θ) )² ]
Update:  θ ← θ − α · ∇θ L  ·· every C steps:  θ⁻ ← θ
▶ live demo — DQN on Mountain Car (2-D state, 3 actions)

An under-powered car in a valley. State = (position, velocity). Three actions: push left / coast / push right. Goal = reach the flag on the right. Solving needs to swing back-and-forth to build momentum — Q-Learning has to learn that nuance.

Total steps0
Episodes0
Best return
Avg G (20)
ε now1.00
Replay size0
Loss

Environment

Q-values heatmap (max over a)

x = position, y = velocity. Brighter = higher value. Arrows show greedy action.

Return per episode & loss

Return G Loss (log)

Why these tricks?

Replay bufferBreaks correlation between consecutive samples (which causes wild gradient updates). Lets us reuse data — sample efficiency.
Target networkThe TD target is moving — if you bootstrap off the same net you're updating, you chase your own tail. Freezing θ⁻ for C steps stabilises learning.
ε decayExplore early, exploit late. Linear or exponential schedule.
Double DQNDecouple action selection from evaluation: a* = argmax Q(s′, ·; θ), target = Q(s′, a*; θ⁻). Reduces over-estimation.
Be patient. Mountain Car is hard for DQN — reward is sparse (only at the flag). A random policy almost never reaches the flag, so the Q-net has nothing to learn from initially. Click Eval rollout periodically to see if the policy has "got it".

← Actor-Critic  ·  Next: Evolutionary →