01 · Multi-Armed Bandits

A row of slot machines. Each has a hidden mean reward. Pull one per timestep. Maximise total reward — but you have to figure out which arm is best by trying them.

The exploration–exploitation trade-off

Pull only the arm you currently think is best (pure exploit) and you may never discover the truly best one. Pull randomly forever (pure explore) and you waste reward. The strategies below all balance these in different ways.

Q_t(a) ← Q_t-1(a) + α · ( r − Q_t-1(a) ) // incremental sample mean
Regret = t · μ* − Σ r // cost of not always pulling the best arm

▶ live demo — 10-armed gaussian testbed

Strategy Arms 10 ε 0.10 UCB c 2.0 Step size α 0.10 Speed 20

Step0

Total Reward0.00

Avg Reward0.00

% Optimal0.0%

Regret0.00

True μ (hidden) Estimated Q(a) Last pull Optimal arm

Average reward over time

Cumulative regret

The strategies, side-by-side

Greedy	Always pull argmax Q(a). Fast at exploiting; can lock onto a bad arm forever.
ε-Greedy	With probability ε explore uniformly, else exploit. Simple, effective baseline.
Optimistic Init	Set Q₀(a) high. Every untried arm "looks great" → forces early exploration.
UCB1	a = argmax Q(a) + c·√(ln t / N(a)). Bonus for under-tried arms. Logarithmic regret.
Thompson Sampling	Bayesian. Sample θ̂(a) from posterior over μ, pick argmax. Often the best in practice.
Gradient Bandit	Maintain preferences H(a), softmax → π(a). H ← H + α(r − r̄)(𝟙 − π).

Try this: set Strategy = ε-Greedy with ε = 0 — pure greedy. Run a few times and watch how often it gets stuck on a sub-optimal arm (% Optimal stays low). Now bump ε to 0.1. Regret grows slower.

Next: Markov Decision Processes →