09 · Evolutionary Methods

RL without gradients. Treat policy parameters as a genome. Evaluate fitness by running episodes. Select the best. Recombine and mutate. Repeat. Surprisingly competitive — OpenAI showed ES rivals TRPO on Atari (2017).

popg+1 = mutate( crossover( select( popg, fit ) ) )
select:  tournament / elitism  ·  mutate:  w ← w + σ · 𝒩(0, I)
▶ live demo — evolve a CartPole controller

Each genome is a linear policy: π(s) = sign(w · s). Four-dimensional state (cart pos, cart vel, pole angle, pole vel), two actions (push left/right). Fitness = steps balanced (max 500).

Generation0
Best fitness
Avg fitness
Mean ‖w‖

Best-of-generation rollout

Population distribution (fitness)

Each dot = one genome. y-axis = fitness; x-axis = projection of weights.

Best, mean, worst fitness per generation

best mean worst

Genetic vs gradient

ProsNo gradient needed. Parallelisable. Works through non-differentiable rewards. Good for hyperparameter search / NAS.
ConsSample-inefficient — full episode per evaluation. Scales poorly to huge param vectors without ES tricks (OpenAI ES, NEAT, CMA-ES).
OpenAI ESPop = parameter noise. Fitness-weighted update: θ ← θ + α · Σ Fi · εi. Equivalent to a stochastic gradient via finite differences.
Try σ. Mutation rate too low = stuck in local optima. Too high = random drift. The sweet spot depends on problem scale — for linear CartPole, σ ≈ 0.3 usually solves in 10-20 generations.

← DQN  ·  Next: Multi-Agent →