09 · Evolutionary Methods

RL without gradients. Treat policy parameters as a genome. Evaluate fitness by running episodes. Select the best. Recombine and mutate. Repeat. Surprisingly competitive — OpenAI showed ES rivals TRPO on Atari (2017).

pop_g+1 = mutate( crossover( select( pop_g, fit ) ) )
select: tournament / elitism · mutate: w ← w + σ · 𝒩(0, I)

▶ live demo — evolve a CartPole controller

Each genome is a linear policy: π(s) = sign(w · s). Four-dimensional state (cart pos, cart vel, pole angle, pole vel), two actions (push left/right). Fitness = steps balanced (max 500).

Pop size 40 Elite frac 0.20 Mutation σ 0.30 Crossover p 0.7 Eval reps 3

Generation0

Best fitness—

Avg fitness—

Mean ‖w‖—

Best-of-generation rollout

Population distribution (fitness)

Each dot = one genome. y-axis = fitness; x-axis = projection of weights.

Best, mean, worst fitness per generation

best mean worst

Genetic vs gradient

Pros	No gradient needed. Parallelisable. Works through non-differentiable rewards. Good for hyperparameter search / NAS.
Cons	Sample-inefficient — full episode per evaluation. Scales poorly to huge param vectors without ES tricks (OpenAI ES, NEAT, CMA-ES).
OpenAI ES	Pop = parameter noise. Fitness-weighted update: θ ← θ + α · Σ F_i · ε_i. Equivalent to a stochastic gradient via finite differences.

Try σ. Mutation rate too low = stuck in local optima. Too high = random drift. The sweet spot depends on problem scale — for linear CartPole, σ ≈ 0.3 usually solves in 10-20 generations.

← DQN · Next: Multi-Agent →