04 · Monte Carlo & Temporal Difference

No model? Learn from interaction. Monte Carlo waits for a full episode, then updates with the actual return G. TD updates after one step using a bootstrapped guess of future return. SARSA is on-policy, Q-Learning is off-policy.

Monte Carlo:  Q(s,a) ← Q(s,a) + α [ G − Q(s,a) ]
SARSA (on-policy):  Q(s,a) ← Q(s,a) + α [ r + γ Q(s′,a′) − Q(s,a) ]
Q-Learning (off-policy):  Q(s,a) ← Q(s,a) + α [ r + γ maxa′ Q(s′,a′) − Q(s,a) ]
Expected SARSA:  Q(s,a) ← Q(s,a) + α [ r + γ Σa′ π(a′|s′) Q(s′,a′) − Q(s,a) ]
▶ live demo — cliff walking

Classic environment: bottom row is a cliff (-100). Walking near it is optimal but risky. Q-Learning learns the optimal (cliff-edge) path; SARSA learns a safer detour. Watch the difference.

Episode0
Last return
Avg return (100)
Steps

Q values (max over a) + greedy policy

Last episode trajectory

Return per episode

MC vs TD: the trade-off

Monte CarloUnbiased estimate of return. High variance. Needs complete episodes. Slow to credit-assign.
TD(0)Biased (bootstraps off learned values). Low variance. Updates online, every step. Faster in practice.
SARSA vs Q-LearningSARSA evaluates the policy it's actually executing (ε-greedy → cares about exploration risk → safer path). Q-Learning evaluates the greedy policy regardless of behavior (→ aggressive optimal path).
n-step / TD(λ)Spectrum between TD(0) and MC. Trade bias for variance.
The cliff teaches the SARSA/Q-Learning difference. With ε=0.1, occasionally the agent picks a random action. Near the cliff that random "down" is fatal. Q-Learning ignores that risk in its target (it uses max), SARSA includes it (it uses the actual next action). Set ε high and the difference becomes dramatic.

← Dynamic Programming  ·  Next: Function Approximation →