10 · Multi-Agent RL — The Axelrod Tournament

The environment now contains other learning agents. The Markov property breaks (the others are non-stationary). The classic study: Robert Axelrod's iterated prisoner's dilemma tournament (1980), which famously crowned Tit-for-Tat.

Payoff matrix (R, S, T, P): (C,C) → (3,3), (C,D) → (0,5), (D,C) → (5,0), (D,D) → (1,1)
T > R > P > S (temptation, reward, punishment, sucker)

▶ live demo — round-robin tournament

Each strategy plays every other (and itself) for N rounds. The strategy with the highest total score wins. There is no "best response" in the abstract — strategies do well by exploiting weaknesses in their opponents.

Rounds per match 200 Noise 0.00 Evolutionary rounds 0

Leaderboard

#	Strategy	Total	Avg/match

Pairwise heatmap (average score for row)

Single match replay

Pick two strategies and click "Watch one match".

P1 P2

Population dynamics (evolutionary mode)

After each tournament, each strategy reproduces proportional to its score. Watch which behaviours dominate.

The strategies

Always Cooperate	Always plays C. Maximally trusting. Exploited by defectors.
Always Defect	Always plays D. Strictly dominant in single-shot. Mutual defection in iterated.
Tit-for-Tat	Start with C, then copy opponent's last move. Nice, retaliatory, forgiving, simple.
Tit-for-Two-Tats	Only retaliate after two defections in a row. More forgiving.
Grim Trigger	Cooperate until opponent defects once, then defect forever.
Pavlov (Win-Stay/Lose-Shift)	Repeat last action if reward ≥ 3, switch otherwise. Self-correcting.
Random	50/50. Adds noise.
Joss	Tit-for-Tat with 10% sneaky defections. Tries to exploit pure TfT.
Suspicious TfT	Like TfT but starts with D.

Axelrod's four lessons: Be nice (don't be first to defect). Be retaliatory (punish defection). Be forgiving (don't hold grudges). Be clear (predictable behaviour invites cooperation).

← Evolutionary · ↩ Back to overview