08 · Deep Q-Networks

Q-Learning + neural net = DQN. Replace the Q-table with a network Q(s, a; θ). To prevent divergence: store transitions in a replay buffer, sample mini-batches uniformly, and bootstrap from a slow-moving target network.

Loss: L(θ) = 𝔼_{(s,a,r,s′)∼D} [ ( r + γ max_a′ Q(s′,a′; θ⁻) − Q(s,a; θ) )² ]
Update: θ ← θ − α · ∇_θ L ·· every C steps: θ⁻ ← θ

▶ live demo — DQN on Mountain Car (2-D state, 3 actions)

An under-powered car in a valley. State = (position, velocity). Three actions: push left / coast / push right. Goal = reach the flag on the right. Solving needs to swing back-and-forth to build momentum — Q-Learning has to learn that nuance.

Hidden units 32 α 0.003 γ 0.99 ε start→end 1.0 → 0.05 Batch 32 Target sync 100 Steps/frame 5

Total steps0

Episodes0

Best return—

Avg G (20)—

ε now1.00

Replay size0

Loss—

Environment

Q-values heatmap (max over a)

x = position, y = velocity. Brighter = higher value. Arrows show greedy action.

Return per episode & loss

Return G Loss (log)

Why these tricks?

Replay buffer	Breaks correlation between consecutive samples (which causes wild gradient updates). Lets us reuse data — sample efficiency.
Target network	The TD target is moving — if you bootstrap off the same net you're updating, you chase your own tail. Freezing θ⁻ for C steps stabilises learning.
ε decay	Explore early, exploit late. Linear or exponential schedule.
Double DQN	Decouple action selection from evaluation: a* = argmax Q(s′, ·; θ), target = Q(s′, a*; θ⁻). Reduces over-estimation.

Be patient. Mountain Car is hard for DQN — reward is sparse (only at the flag). A random policy almost never reaches the flag, so the Q-net has nothing to learn from initially. Click Eval rollout periodically to see if the policy has "got it".

← Actor-Critic · Next: Evolutionary →