04 · Monte Carlo & Temporal Difference
No model? Learn from interaction. Monte Carlo waits for a full episode, then updates with the actual return G. TD updates after one step using a bootstrapped guess of future return. SARSA is on-policy, Q-Learning is off-policy.
Monte Carlo: Q(s,a) ← Q(s,a) + α [ G − Q(s,a) ]
SARSA (on-policy): Q(s,a) ← Q(s,a) + α [ r + γ Q(s′,a′) − Q(s,a) ]
Q-Learning (off-policy): Q(s,a) ← Q(s,a) + α [ r + γ maxa′ Q(s′,a′) − Q(s,a) ]
Expected SARSA: Q(s,a) ← Q(s,a) + α [ r + γ Σa′ π(a′|s′) Q(s′,a′) − Q(s,a) ]
SARSA (on-policy): Q(s,a) ← Q(s,a) + α [ r + γ Q(s′,a′) − Q(s,a) ]
Q-Learning (off-policy): Q(s,a) ← Q(s,a) + α [ r + γ maxa′ Q(s′,a′) − Q(s,a) ]
Expected SARSA: Q(s,a) ← Q(s,a) + α [ r + γ Σa′ π(a′|s′) Q(s′,a′) − Q(s,a) ]
▶ live demo — cliff walking
Classic environment: bottom row is a cliff (-100). Walking near it is optimal but risky. Q-Learning learns the optimal (cliff-edge) path; SARSA learns a safer detour. Watch the difference.
Episode0
Last return—
Avg return (100)—
Steps—
Q values (max over a) + greedy policy
Last episode trajectory
Return per episode
MC vs TD: the trade-off
| Monte Carlo | Unbiased estimate of return. High variance. Needs complete episodes. Slow to credit-assign. |
| TD(0) | Biased (bootstraps off learned values). Low variance. Updates online, every step. Faster in practice. |
| SARSA vs Q-Learning | SARSA evaluates the policy it's actually executing (ε-greedy → cares about exploration risk → safer path). Q-Learning evaluates the greedy policy regardless of behavior (→ aggressive optimal path). |
| n-step / TD(λ) | Spectrum between TD(0) and MC. Trade bias for variance. |
The cliff teaches the SARSA/Q-Learning difference. With ε=0.1, occasionally the agent picks a random action. Near the cliff that random "down" is fatal. Q-Learning ignores that risk in its target (it uses max), SARSA includes it (it uses the actual next action). Set ε high and the difference becomes dramatic.