03 · Dynamic Programming

When you know the MDP (transitions P and rewards R), you can solve it directly. The Bellman equations become an iterative update — sweep through every state, replace its value with the expected value of the best one-step move, repeat until nothing changes.

Value Iteration: v_k+1(s) = max_a Σ_s′,r p(s′,r|s,a) [ r + γ v_k(s′) ]
Policy Iteration: alternate { Policy Eval → v_π } then { Policy Improve → π′(s) = argmax_a q(s,a) }

▶ live demo — sweep the Bellman backup across the grid

Algorithm γ 0.95 Slip 0.10 Step reward -0.04 Edit

Iteration0

Max Δ—

Converged?no

v(start)—

Value function v(s)

Greedy policy π(s)

Δ convergence

Value Iteration vs Policy Iteration

Value Iter	One Bellman optimality sweep per iteration. Combines eval & improve. Simple.
Policy Iter	Full policy evaluation, then policy improvement, then repeat. Each policy improvement is monotonic. Often converges in few outer iterations.
Asynchronous DP	Backup states in any order, not full sweeps. Good when state space is huge.

The cost of knowing the model: DP requires complete knowledge of P and R. Real problems usually don't give you that. Monte Carlo and TD (next pages) drop this requirement — they learn from samples alone.

← MDP · Next: MC / TD →