02 · Markov Decision Processes

A bandit only has one state. An MDP has many. The agent is now in state s, picks action a, transitions to s′ with probability P(s′|s,a), and gets reward r. Repeat. The "Markov" bit: where you go next depends only on where you are now, not how you got there.

MDP = ⟨ S, A, P, R, γ ⟩
Return G_t = r_t+1 + γ·r_t+2 + γ²·r_t+3 + … = Σ_k γᵏ r_t+k+1
State value vπ(s) = 𝔼_π[ G_t | s_t=s ]
Bellman vπ(s) = Σ_a π(a|s) Σ_s′,r p(s′,r|s,a) [ r + γ vπ(s′) ]

▶ live demo — build a gridworld, watch transitions

Edit mode Rows 5 Cols 7 Slip prob. 0.10 γ 0.95 Step reward -0.04

Click cells to edit. Use the random policy below — agent picks a uniformly at random and you'll see how the slip probability bends its trajectory.

Speed

Episode0

Step t0

Return G0.00

Avg return0.00

Transition for current state

Showing the action distribution from state (s) for each of the 4 actions and the resulting next-state probabilities.

Return chart

Key vocabulary

state s	What the agent observes. Here: which cell it's in.
action a	What the agent does. Here: ↑ → ↓ ←.
reward r	Scalar feedback after each action. Step penalty + terminal bonus.
policy π	Function from state to action distribution. `π(a\|s)`.
return G	Sum of discounted future rewards.
γ discount	How much you care about the far future (0 = myopic, 1 = patient).
v(s)	Expected return starting from s.
q(s,a)	Expected return after taking a in s.

Why "Markov"? The next state distribution depends only on (s, a), not the entire history. This makes the problem tractable: we only need to learn v(s) or q(s,a), not v(trajectory).

← Bandits · Next: Dynamic Programming →