03 · Dynamic Programming
When you know the MDP (transitions P and rewards R), you can solve it directly. The Bellman equations become an iterative update — sweep through every state, replace its value with the expected value of the best one-step move, repeat until nothing changes.
Value Iteration: vk+1(s) = maxa Σs′,r p(s′,r|s,a) [ r + γ vk(s′) ]
Policy Iteration: alternate { Policy Eval → vπ } then { Policy Improve → π′(s) = argmaxa q(s,a) }
Policy Iteration: alternate { Policy Eval → vπ } then { Policy Improve → π′(s) = argmaxa q(s,a) }
▶ live demo — sweep the Bellman backup across the grid
Iteration0
Max Δ—
Converged?no
v(start)—
Value function v(s)
Greedy policy π(s)
Δ convergence
Value Iteration vs Policy Iteration
| Value Iter | One Bellman optimality sweep per iteration. Combines eval & improve. Simple. |
| Policy Iter | Full policy evaluation, then policy improvement, then repeat. Each policy improvement is monotonic. Often converges in few outer iterations. |
| Asynchronous DP | Backup states in any order, not full sweeps. Good when state space is huge. |
The cost of knowing the model: DP requires complete knowledge of P and R. Real problems usually don't give you that. Monte Carlo and TD (next pages) drop this requirement — they learn from samples alone.