02 · Markov Decision Processes
A bandit only has one state. An MDP has many. The agent is now in state s, picks action a, transitions to s′ with probability P(s′|s,a), and gets reward r. Repeat. The "Markov" bit: where you go next depends only on where you are now, not how you got there.
MDP = ⟨ S, A, P, R, γ ⟩
Return Gt = rt+1 + γ·rt+2 + γ²·rt+3 + … = Σk γᵏ rt+k+1
State value vπ(s) = 𝔼π[ Gt | st=s ]
Bellman vπ(s) = Σa π(a|s) Σs′,r p(s′,r|s,a) [ r + γ vπ(s′) ]
Return Gt = rt+1 + γ·rt+2 + γ²·rt+3 + … = Σk γᵏ rt+k+1
State value vπ(s) = 𝔼π[ Gt | st=s ]
Bellman vπ(s) = Σa π(a|s) Σs′,r p(s′,r|s,a) [ r + γ vπ(s′) ]
▶ live demo — build a gridworld, watch transitions
Click cells to edit. Use the random policy below — agent picks a uniformly at random and you'll see how the slip probability bends its trajectory.
Episode0
Step t0
Return G0.00
Avg return0.00
Transition for current state
Showing the action distribution from state (s) for each of the 4 actions and the resulting next-state probabilities.
Return chart
Key vocabulary
| state s | What the agent observes. Here: which cell it's in. |
| action a | What the agent does. Here: ↑ → ↓ ←. |
| reward r | Scalar feedback after each action. Step penalty + terminal bonus. |
| policy π | Function from state to action distribution. π(a|s). |
| return G | Sum of discounted future rewards. |
| γ discount | How much you care about the far future (0 = myopic, 1 = patient). |
| v(s) | Expected return starting from s. |
| q(s,a) | Expected return after taking a in s. |
Why "Markov"? The next state distribution depends only on (s, a), not the entire history. This makes the problem tractable: we only need to learn v(s) or q(s,a), not v(trajectory).