06 · REINFORCE — Policy Gradients
Instead of learning a value function and being greedy, learn the policy directly. Parameterise π(a|s; θ) — usually as softmax — and use the policy gradient theorem to nudge θ in the direction of higher expected return.
Objective: J(θ) = 𝔼τ∼πθ [ G(τ) ]
Policy gradient theorem: ∇θ J = 𝔼τ [ Σt Gt ∇θ log πθ(at | st) ]
REINFORCE update (per episode): θ ← θ + α · Σt γᵗ · (Gt − b) · ∇θ log πθ(at|st)
Policy gradient theorem: ∇θ J = 𝔼τ [ Σt Gt ∇θ log πθ(at | st) ]
REINFORCE update (per episode): θ ← θ + α · Σt γᵗ · (Gt − b) · ∇θ log πθ(at|st)
▶ live demo — REINFORCE on a 5-state corridor
Five states in a row. Reward +1 at the rightmost. Each state has two actions (left/right). The policy is softmax over 2 logits per state. Watch θ drift towards "always right". The baseline (running mean return) drastically cuts variance.
Episode0
Last return—
Avg G (50)—
Var(G, 50)—
Baseline b̄0.00
Policy π(a|s) — softmax of θ
Each column is a state. The blue bar shows π(right|s), purple shows π(left|s).
Last episode trajectory
Animated random rollouts under the current policy.
Return per episode (raw vs moving avg)
Why policy gradients?
| Continuous actions | Q-Learning needs maxa Q. With continuous a, that max is hard. Policy gradients just sample. |
| Stochastic optimal policies | Some problems require randomised action (e.g., rock-paper-scissors). Q-Learning converges to deterministic policies. |
| Smooth optimisation | Small change in θ → small change in π. With Q-Learning, a tiny change can flip the argmax. |
| High variance | The downside: Gt from sample returns is noisy. Hence baselines, advantage, A2C, PPO … |
Try toggling the baseline. Without it, the gradient estimate is unbiased but high-variance — learning is jittery. Subtracting any state-only baseline keeps it unbiased and cuts variance. A2C (next page) takes this further by learning the baseline as a critic.