10 · Multi-Agent RL — The Axelrod Tournament

The environment now contains other learning agents. The Markov property breaks (the others are non-stationary). The classic study: Robert Axelrod's iterated prisoner's dilemma tournament (1980), which famously crowned Tit-for-Tat.

Payoff matrix (R, S, T, P):  (C,C) → (3,3),  (C,D) → (0,5),  (D,C) → (5,0),  (D,D) → (1,1)
T > R > P > S    (temptation, reward, punishment, sucker)
▶ live demo — round-robin tournament

Each strategy plays every other (and itself) for N rounds. The strategy with the highest total score wins. There is no "best response" in the abstract — strategies do well by exploiting weaknesses in their opponents.

Leaderboard

#StrategyTotalAvg/match

Pairwise heatmap (average score for row)

Single match replay

Pick two strategies and click "Watch one match".

Population dynamics (evolutionary mode)

After each tournament, each strategy reproduces proportional to its score. Watch which behaviours dominate.

The strategies

Always CooperateAlways plays C. Maximally trusting. Exploited by defectors.
Always DefectAlways plays D. Strictly dominant in single-shot. Mutual defection in iterated.
Tit-for-TatStart with C, then copy opponent's last move. Nice, retaliatory, forgiving, simple.
Tit-for-Two-TatsOnly retaliate after two defections in a row. More forgiving.
Grim TriggerCooperate until opponent defects once, then defect forever.
Pavlov (Win-Stay/Lose-Shift)Repeat last action if reward ≥ 3, switch otherwise. Self-correcting.
Random50/50. Adds noise.
JossTit-for-Tat with 10% sneaky defections. Tries to exploit pure TfT.
Suspicious TfTLike TfT but starts with D.
Axelrod's four lessons: Be nice (don't be first to defect). Be retaliatory (punish defection). Be forgiving (don't hold grudges). Be clear (predictable behaviour invites cooperation).

← Evolutionary  ·  ↩ Back to overview