06 · REINFORCE — Policy Gradients

Instead of learning a value function and being greedy, learn the policy directly. Parameterise π(a|s; θ) — usually as softmax — and use the policy gradient theorem to nudge θ in the direction of higher expected return.

Objective: J(θ) = 𝔼_{τ∼π_θ} [ G(τ) ]
Policy gradient theorem: ∇_θ J = 𝔼_τ [ Σ_t G_t ∇_θ log π_θ(a_t | s_t) ]
REINFORCE update (per episode): θ ← θ + α · Σ_t γᵗ · (G_t − b) · ∇_θ log π_θ(a_t|s_t)

▶ live demo — REINFORCE on a 5-state corridor

Five states in a row. Reward +1 at the rightmost. Each state has two actions (left/right). The policy is softmax over 2 logits per state. Watch θ drift towards "always right". The baseline (running mean return) drastically cuts variance.

Environment α (LR) 0.10 γ 0.99 use baseline (avg G) Episodes/frame 1

Episode0

Last return—

Avg G (50)—

Var(G, 50)—

Baseline b̄0.00

Policy π(a|s) — softmax of θ

Each column is a state. The blue bar shows π(right|s), purple shows π(left|s).

Last episode trajectory

Animated random rollouts under the current policy.

Return per episode (raw vs moving avg)

Why policy gradients?

Continuous actions	Q-Learning needs `max_a Q`. With continuous a, that max is hard. Policy gradients just sample.
Stochastic optimal policies	Some problems require randomised action (e.g., rock-paper-scissors). Q-Learning converges to deterministic policies.
Smooth optimisation	Small change in θ → small change in π. With Q-Learning, a tiny change can flip the argmax.
High variance	The downside: G_t from sample returns is noisy. Hence baselines, advantage, A2C, PPO …

Try toggling the baseline. Without it, the gradient estimate is unbiased but high-variance — learning is jittery. Subtracting any state-only baseline keeps it unbiased and cuts variance. A2C (next page) takes this further by learning the baseline as a critic.

← Function Approximation · Next: Actor-Critic →