02 · Markov Decision Processes

A bandit only has one state. An MDP has many. The agent is now in state s, picks action a, transitions to s′ with probability P(s′|s,a), and gets reward r. Repeat. The "Markov" bit: where you go next depends only on where you are now, not how you got there.

MDP  =  ⟨ S, A, P, R, γ ⟩
Return  Gt = rt+1 + γ·rt+2 + γ²·rt+3 + … = Σk γᵏ rt+k+1
State value   vπ(s) = 𝔼π[ Gt | st=s ]
Bellman   vπ(s) = Σa π(a|s) Σs′,r p(s′,r|s,a) [ r + γ vπ(s′) ]
▶ live demo — build a gridworld, watch transitions

Click cells to edit. Use the random policy below — agent picks a uniformly at random and you'll see how the slip probability bends its trajectory.

Episode0
Step t0
Return G0.00
Avg return0.00

Transition for current state

Showing the action distribution from state (s) for each of the 4 actions and the resulting next-state probabilities.

Return chart

Key vocabulary

state sWhat the agent observes. Here: which cell it's in.
action aWhat the agent does. Here: ↑ → ↓ ←.
reward rScalar feedback after each action. Step penalty + terminal bonus.
policy πFunction from state to action distribution. π(a|s).
return GSum of discounted future rewards.
γ discountHow much you care about the far future (0 = myopic, 1 = patient).
v(s)Expected return starting from s.
q(s,a)Expected return after taking a in s.
Why "Markov"? The next state distribution depends only on (s, a), not the entire history. This makes the problem tractable: we only need to learn v(s) or q(s,a), not v(trajectory).

← Bandits  ·  Next: Dynamic Programming →