01 · Multi-Armed Bandits

A row of slot machines. Each has a hidden mean reward. Pull one per timestep. Maximise total reward — but you have to figure out which arm is best by trying them.

The exploration–exploitation trade-off

Pull only the arm you currently think is best (pure exploit) and you may never discover the truly best one. Pull randomly forever (pure explore) and you waste reward. The strategies below all balance these in different ways.

Qt(a) ← Qt-1(a) + α · ( r − Qt-1(a) )   // incremental sample mean
Regret = t · μ* − Σ r  // cost of not always pulling the best arm
▶ live demo — 10-armed gaussian testbed
Step0
Total Reward0.00
Avg Reward0.00
% Optimal0.0%
Regret0.00
True μ (hidden) Estimated Q(a) Last pull Optimal arm

Average reward over time

Cumulative regret

The strategies, side-by-side

GreedyAlways pull argmax Q(a). Fast at exploiting; can lock onto a bad arm forever.
ε-GreedyWith probability ε explore uniformly, else exploit. Simple, effective baseline.
Optimistic InitSet Q₀(a) high. Every untried arm "looks great" → forces early exploration.
UCB1a = argmax Q(a) + c·√(ln t / N(a)). Bonus for under-tried arms. Logarithmic regret.
Thompson SamplingBayesian. Sample θ̂(a) from posterior over μ, pick argmax. Often the best in practice.
Gradient BanditMaintain preferences H(a), softmax → π(a). H ← H + α(r − r̄)(𝟙 − π).
Try this: set Strategy = ε-Greedy with ε = 0 — pure greedy. Run a few times and watch how often it gets stuck on a sub-optimal arm (% Optimal stays low). Now bump ε to 0.1. Regret grows slower.

Next: Markov Decision Processes →