07 · Actor-Critic (A2C)
REINFORCE uses the full Monte-Carlo return G — unbiased but high variance. Actor-Critic learns a value function v̂(s; w) alongside the policy and uses it as a baseline (or even as a one-step bootstrap target). Massively reduces variance, allows online learning.
Advantage: At = rt+1 + γ v̂(st+1; w) − v̂(st; w) // the TD-error
Actor: θ ← θ + αθ · At · ∇θ log πθ(at|st)
Critic: w ← w + αw · At · ∇w v̂(st; w)
Actor: θ ← θ + αθ · At · ∇θ log πθ(at|st)
Critic: w ← w + αw · At · ∇w v̂(st; w)
▶ live demo — REINFORCE vs A2C, same environment, side-by-side
REINFORCE (Monte Carlo baseline)
Ep0
Avg G (50)—
Var(G, 50)—
π(right|s)
A2C (advantage actor-critic)
Ep0
Avg G (50)—
Var(G, 50)—
π(right|s) & v̂(s)
Return per episode
REINFORCE
A2C
Spotting the difference
| Variance | A2C's bootstrapped advantage has lower variance than a full return — learning curves are smoother. |
| Bias | A2C is biased early because v̂ is wrong. As v̂ improves, bias shrinks. |
| Online vs episodic | A2C can update at every step. REINFORCE waits for episode end. |
| Extensions | GAE (generalised advantage estimation), TRPO, PPO. All build on this idea. |
The intuition: instead of asking "was this whole trajectory good?", A2C asks "did this action turn out better or worse than the critic expected?" That focused signal is what makes deep RL tractable on hard environments.