01 · Multi-Armed Bandits
A row of slot machines. Each has a hidden mean reward. Pull one per timestep. Maximise total reward — but you have to figure out which arm is best by trying them.
The exploration–exploitation trade-off
Pull only the arm you currently think is best (pure exploit) and you may never discover the truly best one. Pull randomly forever (pure explore) and you waste reward. The strategies below all balance these in different ways.
Qt(a) ← Qt-1(a) + α · ( r − Qt-1(a) ) // incremental sample mean
Regret = t · μ* − Σ r // cost of not always pulling the best arm
Regret = t · μ* − Σ r // cost of not always pulling the best arm
▶ live demo — 10-armed gaussian testbed
Step0
Total Reward0.00
Avg Reward0.00
% Optimal0.0%
Regret0.00
True μ (hidden)
Estimated Q(a)
Last pull
Optimal arm
Average reward over time
Cumulative regret
The strategies, side-by-side
| Greedy | Always pull argmax Q(a). Fast at exploiting; can lock onto a bad arm forever. |
| ε-Greedy | With probability ε explore uniformly, else exploit. Simple, effective baseline. |
| Optimistic Init | Set Q₀(a) high. Every untried arm "looks great" → forces early exploration. |
| UCB1 | a = argmax Q(a) + c·√(ln t / N(a)). Bonus for under-tried arms. Logarithmic regret. |
| Thompson Sampling | Bayesian. Sample θ̂(a) from posterior over μ, pick argmax. Often the best in practice. |
| Gradient Bandit | Maintain preferences H(a), softmax → π(a). H ← H + α(r − r̄)(𝟙 − π). |
Try this: set
Strategy = ε-Greedy with ε = 0 — pure greedy. Run a few times and watch how often it gets stuck on a sub-optimal arm (% Optimal stays low). Now bump ε to 0.1. Regret grows slower.