10 · Multi-Agent RL — The Axelrod Tournament
The environment now contains other learning agents. The Markov property breaks (the others are non-stationary). The classic study: Robert Axelrod's iterated prisoner's dilemma tournament (1980), which famously crowned Tit-for-Tat.
Payoff matrix (R, S, T, P):
(C,C) → (3,3), (C,D) → (0,5), (D,C) → (5,0), (D,D) → (1,1)
T > R > P > S (temptation, reward, punishment, sucker)
T > R > P > S (temptation, reward, punishment, sucker)
▶ live demo — round-robin tournament
Each strategy plays every other (and itself) for N rounds. The strategy with the highest total score wins. There is no "best response" in the abstract — strategies do well by exploiting weaknesses in their opponents.
Leaderboard
| # | Strategy | Total | Avg/match |
|---|
Pairwise heatmap (average score for row)
Single match replay
Pick two strategies and click "Watch one match".
Population dynamics (evolutionary mode)
After each tournament, each strategy reproduces proportional to its score. Watch which behaviours dominate.
The strategies
| Always Cooperate | Always plays C. Maximally trusting. Exploited by defectors. |
| Always Defect | Always plays D. Strictly dominant in single-shot. Mutual defection in iterated. |
| Tit-for-Tat | Start with C, then copy opponent's last move. Nice, retaliatory, forgiving, simple. |
| Tit-for-Two-Tats | Only retaliate after two defections in a row. More forgiving. |
| Grim Trigger | Cooperate until opponent defects once, then defect forever. |
| Pavlov (Win-Stay/Lose-Shift) | Repeat last action if reward ≥ 3, switch otherwise. Self-correcting. |
| Random | 50/50. Adds noise. |
| Joss | Tit-for-Tat with 10% sneaky defections. Tries to exploit pure TfT. |
| Suspicious TfT | Like TfT but starts with D. |
Axelrod's four lessons: Be nice (don't be first to defect). Be retaliatory (punish defection). Be forgiving (don't hold grudges). Be clear (predictable behaviour invites cooperation).