04 · Monte Carlo & Temporal Difference

No model? Learn from interaction. Monte Carlo waits for a full episode, then updates with the actual return G. TD updates after one step using a bootstrapped guess of future return. SARSA is on-policy, Q-Learning is off-policy.

Monte Carlo: Q(s,a) ← Q(s,a) + α [ G − Q(s,a) ]
SARSA (on-policy): Q(s,a) ← Q(s,a) + α [ r + γ Q(s′,a′) − Q(s,a) ]
Q-Learning (off-policy): Q(s,a) ← Q(s,a) + α [ r + γ max_a′ Q(s′,a′) − Q(s,a) ]
Expected SARSA: Q(s,a) ← Q(s,a) + α [ r + γ Σ_a′ π(a′|s′) Q(s′,a′) − Q(s,a) ]

▶ live demo — cliff walking

Classic environment: bottom row is a cliff (-100). Walking near it is optimal but risky. Q-Learning learns the optimal (cliff-edge) path; SARSA learns a safer detour. Watch the difference.

Algorithm Environment α 0.10 ε 0.10 γ 1.00 Episodes / frame 1

Episode0

Last return—

Avg return (100)—

Steps—

Q values (max over a) + greedy policy

Last episode trajectory

Return per episode

MC vs TD: the trade-off

Monte Carlo	Unbiased estimate of return. High variance. Needs complete episodes. Slow to credit-assign.
TD(0)	Biased (bootstraps off learned values). Low variance. Updates online, every step. Faster in practice.
SARSA vs Q-Learning	SARSA evaluates the policy it's actually executing (ε-greedy → cares about exploration risk → safer path). Q-Learning evaluates the greedy policy regardless of behavior (→ aggressive optimal path).
n-step / TD(λ)	Spectrum between TD(0) and MC. Trade bias for variance.

The cliff teaches the SARSA/Q-Learning difference. With ε=0.1, occasionally the agent picks a random action. Near the cliff that random "down" is fatal. Q-Learning ignores that risk in its target (it uses max), SARSA includes it (it uses the actual next action). Set ε high and the difference becomes dramatic.

← Dynamic Programming · Next: Function Approximation →