07 · Actor-Critic (A2C)

REINFORCE uses the full Monte-Carlo return G — unbiased but high variance. Actor-Critic learns a value function v̂(s; w) alongside the policy and uses it as a baseline (or even as a one-step bootstrap target). Massively reduces variance, allows online learning.

Advantage: A_t = r_t+1 + γ v̂(s_t+1; w) − v̂(s_t; w) // the TD-error
Actor: θ ← θ + α_θ · A_t · ∇_θ log π_θ(a_t|s_t)
Critic: w ← w + α_w · A_t · ∇_w v̂(s_t; w)

▶ live demo — REINFORCE vs A2C, same environment, side-by-side

Environment Actor α 0.10 Critic α 0.20 γ 0.95 Episodes/frame 2

REINFORCE (Monte Carlo baseline)

Ep0

Avg G (50)—

Var(G, 50)—

π(right|s)

A2C (advantage actor-critic)

Ep0

Avg G (50)—

Var(G, 50)—

π(right|s) & v̂(s)

Return per episode

REINFORCE A2C

Spotting the difference

Variance	A2C's bootstrapped advantage has lower variance than a full return — learning curves are smoother.
Bias	A2C is biased early because v̂ is wrong. As v̂ improves, bias shrinks.
Online vs episodic	A2C can update at every step. REINFORCE waits for episode end.
Extensions	GAE (generalised advantage estimation), TRPO, PPO. All build on this idea.

The intuition: instead of asking "was this whole trajectory good?", A2C asks "did this action turn out better or worse than the critic expected?" That focused signal is what makes deep RL tractable on hard environments.

← REINFORCE · Next: DQN →