09 · Evolutionary Methods
RL without gradients. Treat policy parameters as a genome. Evaluate fitness by running episodes. Select the best. Recombine and mutate. Repeat. Surprisingly competitive — OpenAI showed ES rivals TRPO on Atari (2017).
popg+1 = mutate( crossover( select( popg, fit ) ) )
select: tournament / elitism · mutate: w ← w + σ · 𝒩(0, I)
select: tournament / elitism · mutate: w ← w + σ · 𝒩(0, I)
▶ live demo — evolve a CartPole controller
Each genome is a linear policy: π(s) = sign(w · s). Four-dimensional state (cart pos, cart vel, pole angle, pole vel), two actions (push left/right). Fitness = steps balanced (max 500).
Generation0
Best fitness—
Avg fitness—
Mean ‖w‖—
Best-of-generation rollout
Population distribution (fitness)
Each dot = one genome. y-axis = fitness; x-axis = projection of weights.
Best, mean, worst fitness per generation
best
mean
worst
Genetic vs gradient
| Pros | No gradient needed. Parallelisable. Works through non-differentiable rewards. Good for hyperparameter search / NAS. |
| Cons | Sample-inefficient — full episode per evaluation. Scales poorly to huge param vectors without ES tricks (OpenAI ES, NEAT, CMA-ES). |
| OpenAI ES | Pop = parameter noise. Fitness-weighted update: θ ← θ + α · Σ Fi · εi. Equivalent to a stochastic gradient via finite differences. |
Try σ. Mutation rate too low = stuck in local optima. Too high = random drift. The sweet spot depends on problem scale — for linear CartPole, σ ≈ 0.3 usually solves in 10-20 generations.