03 · Dynamic Programming

When you know the MDP (transitions P and rewards R), you can solve it directly. The Bellman equations become an iterative update — sweep through every state, replace its value with the expected value of the best one-step move, repeat until nothing changes.

Value Iteration:  vk+1(s) = maxa Σs′,r p(s′,r|s,a) [ r + γ vk(s′) ]
Policy Iteration:  alternate { Policy Eval → vπ } then { Policy Improve → π′(s) = argmaxa q(s,a) }
▶ live demo — sweep the Bellman backup across the grid
Iteration0
Max Δ
Converged?no
v(start)

Value function v(s)

Greedy policy π(s)

Δ convergence

Value Iteration vs Policy Iteration

Value IterOne Bellman optimality sweep per iteration. Combines eval & improve. Simple.
Policy IterFull policy evaluation, then policy improvement, then repeat. Each policy improvement is monotonic. Often converges in few outer iterations.
Asynchronous DPBackup states in any order, not full sweeps. Good when state space is huge.
The cost of knowing the model: DP requires complete knowledge of P and R. Real problems usually don't give you that. Monte Carlo and TD (next pages) drop this requirement — they learn from samples alone.

← MDP  ·  Next: MC / TD →