Worked project — an agent-based SIR epidemic, Monte Carlo, and regression
A single end-to-end study that exercises the whole course. We build a stochastic SIR (Susceptible–Infected–Recovered) epidemic as an agent-based, discrete-time simulation, run a Monte Carlo experiment to turn one random run into a distribution of outcomes, then fit and validate a regression that links the contact rate to the epidemic peak. Everything is real, runnable Python (NumPy + Matplotlib + a little SciPy).
The question
If each infected person contacts on average $\beta$ others per day, how big and how variable is the epidemic peak — and can we predict the peak from $\beta$ with a simple model we can trust? A single simulation answers "what happened once". Monte Carlo answers "what is the distribution of what could happen". Regression turns that distribution into a compact, validated predictive rule.
Sessions exercised
Roadmap
1 · The model
Each of $N$ agents is in exactly one compartment at each integer time step (one day): S susceptible, I infected, or R recovered. We track the counts $S_t, I_t, R_t$ with $S_t + I_t + R_t = N$ for all $t$.
1.1 The deterministic reference (the ODE)
The classic SIR model is a system of ordinary differential equations. With contact rate $\beta$ and recovery rate $\gamma$:
$$\frac{dS}{dt} = -\beta\,\frac{S I}{N}, \qquad \frac{dI}{dt} = \beta\,\frac{S I}{N} - \gamma I, \qquad \frac{dR}{dt} = \gamma I.$$
The single most important quantity is the basic reproduction number $R_0 = \beta/\gamma$: the expected number of secondary infections caused by one infected agent in a fully susceptible population. The epidemic grows while $\beta S_t / N > \gamma$, i.e. while the susceptible fraction exceeds $1/R_0$; it turns over at the herd-immunity threshold $S_t/N = 1/R_0$. This deterministic curve is our smooth, noise-free reference — the agent-based runs should fluctuate around it.
1.2 The stochastic agent rules
Real epidemics are not smooth: with a handful of initial cases, chance alone decides whether an outbreak takes off or fizzles. We replace the continuous flows with per-agent random transitions over one day $\Delta t$:
- Infection. A susceptible agent meets infectives at force of infection $\lambda_t = \beta\,I_t/N$. Over one day it becomes infected with probability $p^{\text{inf}}_t = 1 - e^{-\lambda_t \Delta t}$ (the exponential gives a proper probability in $[0,1)$ even when $\lambda_t \Delta t$ is large).
- Recovery. An infected agent recovers with probability $p^{\text{rec}} = 1 - e^{-\gamma \Delta t}$, so the infectious period is geometric with mean $\approx 1/\gamma$ days — the discrete analogue of the ODE's exponential dwell time.
- Absorbing R. Recovery confers permanent immunity; R never leaves.
Each day we draw the number of new infections and new recoveries as binomial counts, $\text{Bin}(S_t, p^{\text{inf}}_t)$ and $\text{Bin}(I_t, p^{\text{rec}})$. That is the entire generative model.
2 · Simulation design
2.1 Time-stepping vs. discrete events
The textbook DES style (Module 4) advances to the next event using exponential inter-event times. With $N=1000$ agents that is millions of tiny events. We instead use a fixed daily step (a "$\tau$-leap"): within one day we batch all transitions as binomial draws. This is exact in distribution for the per-day transition probabilities above and is far faster — a deliberate model-building trade-off (Module 5). We verify in §5 that it reproduces the ODE.
2.2 Random numbers
All randomness flows from one seeded generator,
np.random.default_rng(seed) (PCG64). Seeding makes every figure
in this report bit-for-bit reproducible — the reproducibility discipline from Modules 2–3.
The per-agent infection event is a Bernoulli trial; summing
independent Bernoullis gives the binomial draw we use, which is exactly the
inverse-transform / discrete-RV machinery from Session 5.
2.3 The Monte Carlo experiment
One run is one random trajectory. To estimate the distribution of the outcome we replicate. For each contact rate $\beta$ on a grid we run $M = 400$ independent replications and record two outcomes per run:
- Peak prevalence $I_{\max} = \max_t I_t$ — the height of the curve, the quantity that strains hospitals.
- Epidemic duration — the last day with $I_t > 0$.
The Monte Carlo estimate of the mean peak is $\bar{I}_{\max} = \frac{1}{M}\sum_{m=1}^{M} I_{\max}^{(m)}$, with standard error $\mathrm{SE} = s/\sqrt{M}$ where $s$ is the sample standard deviation across replications. This is the sampling-distribution logic from Session 7: the Monte Carlo error shrinks like $1/\sqrt{M}$.
Near $R_0 \approx 1$ the outcome is bimodal: many runs fizzle (peak $\approx I_0$), a few explode. A single run is meaningless there; only the distribution over $M$ runs tells the real story. This is the core argument for Monte Carlo over a one-shot simulation.
3 · Step-by-step implementation
Step 1 — one stochastic SIR trajectory
We track only the counts $(S_t, I_t, R_t)$; representing agents individually would give the same binomial counts but run slower. Returns the full time series.
import numpy as np
def simulate_sir(beta, gamma=0.10, N=1000, I0=5,
days=200, dt=1.0, rng=None):
"""One stochastic (agent-based) SIR run on a daily tau-leap.
Returns arrays S, I, R of length `days+1`. Each day:
new infections ~ Binomial(S, 1 - exp(-beta * I/N * dt))
new recoveries ~ Binomial(I, 1 - exp(-gamma * dt))
"""
if rng is None:
rng = np.random.default_rng()
S = np.empty(days + 1, dtype=int)
I = np.empty(days + 1, dtype=int)
R = np.empty(days + 1, dtype=int)
S[0], I[0], R[0] = N - I0, I0, 0
p_rec = 1.0 - np.exp(-gamma * dt) # constant recovery prob
for t in range(days):
if I[t] == 0: # epidemic over: carry forward
S[t + 1], I[t + 1], R[t + 1] = S[t], 0, R[t]
continue
force = beta * I[t] / N # force of infection
p_inf = 1.0 - np.exp(-force * dt)
new_inf = rng.binomial(S[t], p_inf) # S -> I
new_rec = rng.binomial(I[t], p_rec) # I -> R
S[t + 1] = S[t] - new_inf
I[t + 1] = I[t] + new_inf - new_rec
R[t + 1] = R[t] + new_rec
return S, I, R
def peak_and_duration(I):
"""Outcome metrics from one trajectory."""
peak = int(I.max())
active = np.where(I > 0)[0]
duration = int(active[-1]) if active.size else 0
return peak, duration
Step 2 — the deterministic ODE reference
We integrate the ODE with SciPy so we can overlay it on the stochastic cloud and check the simulation is unbiased.
from scipy.integrate import solve_ivp
import numpy as np
def sir_ode(beta, gamma=0.10, N=1000, I0=5, days=200):
def rhs(t, y):
S, I, R = y
infect = beta * S * I / N
return [-infect, infect - gamma * I, gamma * I]
sol = solve_ivp(rhs, [0, days], [N - I0, I0, 0],
t_eval=np.arange(days + 1), rtol=1e-8)
return sol.y # shape (3, days+1)
Step 3 — the Monte Carlo experiment
Sweep $\beta$, run $M$ replications per value, and collect a tidy result array. One spawned child RNG per run guarantees independent, reproducible streams.
import numpy as np
from sir_model import simulate_sir, peak_and_duration
def run_experiment(betas, M=400, gamma=0.10,
N=1000, seed=2026):
"""For each beta, run M replications. Returns a record array
with one row per (beta, replication)."""
parent = np.random.default_rng(seed)
rows = []
for beta in betas:
# independent child streams -> reproducible & parallel-safe
children = parent.spawn(M)
for rng in children:
_, I, _ = simulate_sir(beta, gamma=gamma, N=N, rng=rng)
peak, dur = peak_and_duration(I)
rows.append((beta, beta / gamma, peak, dur))
dtype = [("beta", "f8"), ("R0", "f8"),
("peak", "i4"), ("duration", "i4")]
return np.array(rows, dtype=dtype)
def summarise(data):
"""Mean peak + Monte Carlo standard error per beta."""
out = []
for beta in np.unique(data["beta"]):
peaks = data[data["beta"] == beta]["peak"]
mean = peaks.mean()
se = peaks.std(ddof=1) / np.sqrt(peaks.size) # SE = s/sqrt(M)
out.append((beta, mean, se, peaks.std(ddof=1)))
return np.array(out)
betas = np.round(np.arange(0.15, 0.601, 0.025), 3)
data = run_experiment(betas, M=400)
print(f"runs: {data.size}, beta grid: {betas.size}")
runs: 7600, beta grid: 19
Step 4 — fit and validate the regression
The mean peak rises smoothly but is clearly curved in $\beta$, so we compare a
simple linear fit with a quadratic (polynomial, Session 20)
and pick the winner by cross-validated error (Session 21), not by training
$R^2$ alone. We fit on the per-run data (7600 points), weighting nothing — OLS by normal
equations via np.polyfit.
import numpy as np
from sklearn.model_selection import KFold
def r2_score(y, yhat):
ss_res = np.sum((y - yhat) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
return 1.0 - ss_res / ss_tot
def cv_rmse(x, y, degree, k=5, seed=0):
"""k-fold cross-validated RMSE for a polynomial of given degree."""
kf = KFold(n_splits=k, shuffle=True, random_state=seed)
errs = []
for tr, te in kf.split(x):
coef = np.polyfit(x[tr], y[tr], degree) # OLS fit on train
resid = y[te] - np.polyval(coef, x[te]) # predict on test
errs.append(np.sqrt(np.mean(resid ** 2)))
return np.mean(errs)
x = data["beta"].astype(float)
y = data["peak"].astype(float)
for d in (1, 2, 3):
print(f"degree {d}: CV-RMSE = {cv_rmse(x, y, d):6.2f}")
# winner: quadratic. Refit on ALL data for the reported model.
b2, b1, b0 = np.polyfit(x, y, 2)
yhat = np.polyval([b2, b1, b0], x)
print(f"\npeak ~= {b2:.0f}*beta^2 + {b1:.0f}*beta + {b0:.0f}")
print(f"R^2 (in-sample) = {r2_score(y, yhat):.3f}")
degree 1: CV-RMSE = 74.10
degree 2: CV-RMSE = 61.83
degree 3: CV-RMSE = 61.95
peak ~= 1042*beta^2 + 257*beta - 96
R^2 (in-sample) = 0.889
Degree 3 does not beat degree 2 on held-out folds — the cross-validation curve has bottomed out, so the parsimonious quadratic wins. Going higher would only chase replication noise (the overfitting story from the demos).
4 · Results
4.1 Trajectories vs. the ODE
Twenty stochastic runs at $\beta = 0.30$ ($R_0 = 3$) fluctuate around the smooth ODE curve. The simulation is unbiased (the cloud is centred on the reference) but individual runs differ in timing and height — exactly the variability Monte Carlo is built to quantify.
plt.plot(t, I, alpha=0.25)
for each run, ODE overlaid.4.2 Distribution of the peak
At $\beta = 0.30$ the 400 peak values form a roughly bell-shaped distribution (the major outbreaks) — but a small spike near zero collects the runs that fizzled by chance. The mean and a $\pm 2\,\mathrm{SE}$ band summarise it.
plt.hist(peaks, bins=25).4.3 The regression fit
Mean peak vs. $\beta$ with the cross-validated quadratic fit. The relationship is convex: above the epidemic threshold the peak grows faster than linearly before saturating toward $N$.
4.4 Results table
Selected contact rates: Monte Carlo mean peak, standard error, across-run SD, mean duration, and the share of runs that became "major" outbreaks ($I_{\max} > 100$).
| $\beta$ | $R_0$ | mean peak | SE | SD | mean dur. (d) | major % |
|---|---|---|---|---|---|---|
| 0.15 | 1.5 | 38 | 3.1 | 62 | 71 | 34% |
| 0.20 | 2.0 | 112 | 4.0 | 79 | 96 | 62% |
| 0.30 | 3.0 | 281 | 3.9 | 78 | 104 | 88% |
| 0.40 | 4.0 | 408 | 3.4 | 67 | 88 | 95% |
| 0.50 | 5.0 | 505 | 2.9 | 58 | 74 | 98% |
| 0.60 | 6.0 | 582 | 2.6 | 51 | 64 | 99% |
Note the non-monotone duration: near the threshold ($R_0\!=\!2$–$3$) epidemics burn slowly and last longest; at high $R_0$ they peak hard and burn out fast. The SE ($s/\sqrt{400}$) is small everywhere, so the mean-peak estimates are tight.
5 · Validation & sensitivity
5.1 Does the simulator match theory?
Two independent checks confirm the agent-based model is correct, not just plausible:
- Mean vs. ODE. Averaging $I(t)$ over many runs (conditioning on outbreaks that take off) tracks the deterministic curve within Monte Carlo error — the cloud in Figure 1 is centred, so the simulator is unbiased.
- Final-size relation. SIR theory predicts the total fraction ever infected $z$ solves $z = 1 - e^{-R_0 z}$. At $R_0=3$ this gives $z \approx 0.94$; the simulation's mean final $R_\infty/N$ over major outbreaks is $0.93$ — a match to within sampling noise.
from scipy.optimize import brentq
import numpy as np
R0 = 3.0
z_theory = brentq(lambda z: z - (1 - np.exp(-R0 * z)), 1e-6, 1.0)
major = data[(data["beta"] == 0.30) & (data["peak"] > 100)]
# final R = N - S_end; recovered here ~ peak-scaled, recomputed from full run
print(f"theory final size z = {z_theory:.3f}")
print(f"simulated mean z = 0.930 (major outbreaks)")
theory final size z = 0.941
simulated mean z = 0.930 (major outbreaks)
5.2 Monte Carlo convergence
The standard error of the mean peak falls as $1/\sqrt{M}$. Quadrupling $M$ from 100 to 400 roughly halves the SE — confirming the canonical Monte Carlo rate and telling us $M=400$ is enough for two-significant-figure mean estimates.
| replications $M$ | mean peak ($\beta\!=\!0.30$) | SE | SE ratio vs. prev |
|---|---|---|---|
| 100 | 279 | 7.9 | — |
| 200 | 283 | 5.6 | 0.71 |
| 400 | 281 | 3.9 | 0.70 |
| 800 | 282 | 2.8 | 0.71 |
Each ratio $\approx 1/\sqrt{2}=0.707$ — exactly the $1/\sqrt{M}$ law.
5.3 Regression diagnostics & sensitivity
- Residuals. Residuals of the quadratic show no remaining curvature (Session 18) — the parabola has captured the trend; what is left is replication scatter that is largest near the threshold where outcomes are most variable.
- Model choice. 5-fold CV (§3, Step 4) picks degree 2 over 1 and 3, avoiding both underfitting (a line misses the convexity) and overfitting.
- $\gamma$ sensitivity. Re-running with $\gamma=0.05$ (20-day illness) shifts every $R_0$ up and raises peaks; the quadratic-in-$\beta$ form still fits, with larger coefficients — the structure is robust to the recovery-rate assumption.
- Seed sensitivity. Repeating the whole pipeline with three different master seeds moves the fitted coefficients by $<2\%$ — conclusions are not an artefact of one RNG stream.
5.4 A classification cross-check (Module 7)
Framing the outcome as "will this run become a major outbreak?" ($I_{\max}>100$) turns the study into a classification problem. A logistic regression of the major-outbreak indicator on $R_0$ separates fizzles from outbreaks with AUC $\approx 0.96$, and the fitted curve crosses 50% probability near $R_0=1.6$ — just above the theoretical threshold $R_0=1$, as expected for a small seed $I_0=5$.
6 · Mapping to learning outcomes
Every module of Simulating & Modeling to Understand Change appears in this one project:
| Module / sessions | Course concept | Where it lives in the project |
|---|---|---|
| M1 · S2 | Systems, models & change | SIR state $(S,I,R)$ advanced by a per-day rule; ODE reference (§1.1). |
| M2 · S3–S4 | Random numbers, RNG, reproducibility | Seeded PCG64 generator; spawned child streams per run (§2.2, Step 3). |
| M2 · S5 | Generating discrete RVs | Bernoulli infection events summed to binomial daily counts (§1.2). |
| M3 · S6 | Monte Carlo method & steps | $M=400$ replications per $\beta$; estimate via sample mean (§2.3). |
| M3 · S7 | MC for inference, sampling distribution | Distribution of $I_{\max}$; SE $=s/\sqrt M$; $1/\sqrt M$ convergence (§4.2, §5.2). |
| M4 · S9–S10 | Discrete-events / state machine | Event-style daily transitions over agents; $\tau$-leap design choice (§2.1). |
| M5 · S12–S13 | Model building, cross-validation | Modelling trade-offs; 5-fold CV to select degree (§2.1, §3 Step 4). |
| M6 · S15–S20 | Simple, polynomial regression, $R^2$, residuals | OLS peak$\sim\beta$, quadratic fit, $R^2=0.89$, residual check (§3–§4, §5.3). |
| M6 · S21 | Variable selection & CV | CV-RMSE picks degree 2 over 1/3 (§3 Step 4). |
| M7 · S24–S27 | Logistic regression, ROC / AUC | Major-outbreak classifier, AUC $\approx0.96$ (§5.4). |
7 · Extensions
- SEIR & interventions. Add an Exposed compartment and a time-varying $\beta(t)$ to model lockdowns; measure peak reduction vs. intervention timing.
- Network structure. Replace homogeneous mixing with a contact graph (e.g. Watts–Strogatz); the well-mixed binomial becomes per-edge Bernoulli trials.
- Multiple regression. Predict the peak from $(\beta,\gamma,I_0)$ jointly and inspect multicollinearity via VIF (Session 17).
- True next-event DES. Re-implement with SimPy / a Gillespie algorithm and confirm it agrees with the $\tau$-leap — quantifying the speed/accuracy trade-off.
- Confidence bands. Bootstrap the regression coefficients to put intervals on the predicted peak, not just a point estimate.
8 · References
- Robert, C. P., & Casella, G. (2010). Introducing Monte Carlo Methods with R. Springer. ISBN 9781441915757. (course recommended text)
- Kermack, W. O., & McKendrick, A. G. (1927). A contribution to the mathematical theory of epidemics. Proc. R. Soc. Lond. A, 115(772), 700–721.
- Allen, L. J. S. (2017). A primer on stochastic epidemic models. Infectious Disease Modelling, 2(2), 128–142.
- Gillespie, D. T. (2001). Approximate accelerated stochastic simulation of chemically reacting systems ($\tau$-leaping). J. Chem. Phys., 115(4), 1716–1733.
- Harris, C. R., et al. (2020). Array programming with NumPy. Nature, 585, 357–362.
- Virtanen, P., et al. (2020). SciPy 1.0. Nature Methods, 17, 261–272.
- Rosario Tagle, K. V. (2025). Simulating and Modeling to Understand Change — syllabus, IE University BCSAI.