06 · REINFORCE — Policy Gradients

Instead of learning a value function and being greedy, learn the policy directly. Parameterise π(a|s; θ) — usually as softmax — and use the policy gradient theorem to nudge θ in the direction of higher expected return.

Objective:  J(θ) = 𝔼τ∼πθ [ G(τ) ]
Policy gradient theorem:  ∇θ J = 𝔼τ [ Σt Gtθ log πθ(at | st) ]
REINFORCE update (per episode):  θ ← θ + α · Σt γᵗ · (Gt − b) · ∇θ log πθ(at|st)
▶ live demo — REINFORCE on a 5-state corridor

Five states in a row. Reward +1 at the rightmost. Each state has two actions (left/right). The policy is softmax over 2 logits per state. Watch θ drift towards "always right". The baseline (running mean return) drastically cuts variance.

Episode0
Last return
Avg G (50)
Var(G, 50)
Baseline b̄0.00

Policy π(a|s) — softmax of θ

Each column is a state. The blue bar shows π(right|s), purple shows π(left|s).

Last episode trajectory

Animated random rollouts under the current policy.

Return per episode (raw vs moving avg)

Why policy gradients?

Continuous actionsQ-Learning needs maxa Q. With continuous a, that max is hard. Policy gradients just sample.
Stochastic optimal policiesSome problems require randomised action (e.g., rock-paper-scissors). Q-Learning converges to deterministic policies.
Smooth optimisationSmall change in θ → small change in π. With Q-Learning, a tiny change can flip the argmax.
High varianceThe downside: Gt from sample returns is noisy. Hence baselines, advantage, A2C, PPO …
Try toggling the baseline. Without it, the gradient estimate is unbiased but high-variance — learning is jittery. Subtracting any state-only baseline keeps it unbiased and cuts variance. A2C (next page) takes this further by learning the baseline as a critic.

← Function Approximation  ·  Next: Actor-Critic →