modeling-lab · worked project — SIR epidemic, Monte Carlo & regression

Worked project — an agent-based SIR epidemic, Monte Carlo, and regression

A single end-to-end study that exercises the whole course. We build a stochastic SIR (Susceptible–Infected–Recovered) epidemic as an agent-based, discrete-time simulation, run a Monte Carlo experiment to turn one random run into a distribution of outcomes, then fit and validate a regression that links the contact rate to the epidemic peak. Everything is real, runnable Python (NumPy + Matplotlib + a little SciPy).

stack

Python · NumPy · Matplotlib · SciPy

model

Stochastic SIR (agent-based)

method

Monte Carlo + OLS regression

replications

$M = 400$ per setting

population

$N = 1000$ agents

runtime

~10 s on a laptop

The question

If each infected person contacts on average $\beta$ others per day, how big and how variable is the epidemic peak — and can we predict the peak from $\beta$ with a simple model we can trust? A single simulation answers "what happened once". Monte Carlo answers "what is the distribution of what could happen". Regression turns that distribution into a compact, validated predictive rule.

Sessions exercised

M1 · Systems & changeS2: a model is a rule that moves a state forward in time.

M2 · Random numbers & RVsS3–S5: a seeded RNG, Bernoulli trials, inverse-transform sampling.

M3 · Monte CarloS6–S7: replication, sampling distributions, CLT-style standard errors.

M4 · Discrete eventsS9–S10: an event-driven state machine over agents.

M5 · Model buildingS12–S13: design choices, train/test split, cross-validation.

M6 · RegressionS15–S21: OLS, $R^2$, residuals, polynomial terms.

M7 · ClassificationS24–S27: a logistic "will it be a major outbreak?" check + ROC/AUC.

Roadmap

1 · the model 2 · simulation design 3 · implementation 4 · results 5 · validation & sensitivity 6 · learning outcomes 7 · extensions 8 · references

1 · The model

Each of $N$ agents is in exactly one compartment at each integer time step (one day): S susceptible, I infected, or R recovered. We track the counts $S_t, I_t, R_t$ with $S_t + I_t + R_t = N$ for all $t$.

1.1 The deterministic reference (the ODE)

The classic SIR model is a system of ordinary differential equations. With contact rate $\beta$ and recovery rate $\gamma$:

$$\frac{dS}{dt} = -\beta\,\frac{S I}{N}, \qquad \frac{dI}{dt} = \beta\,\frac{S I}{N} - \gamma I, \qquad \frac{dR}{dt} = \gamma I.$$

The single most important quantity is the basic reproduction number $R_0 = \beta/\gamma$: the expected number of secondary infections caused by one infected agent in a fully susceptible population. The epidemic grows while $\beta S_t / N > \gamma$, i.e. while the susceptible fraction exceeds $1/R_0$; it turns over at the herd-immunity threshold $S_t/N = 1/R_0$. This deterministic curve is our smooth, noise-free reference — the agent-based runs should fluctuate around it.

1.2 The stochastic agent rules

Real epidemics are not smooth: with a handful of initial cases, chance alone decides whether an outbreak takes off or fizzles. We replace the continuous flows with per-agent random transitions over one day $\Delta t$:

Infection. A susceptible agent meets infectives at force of infection $\lambda_t = \beta\,I_t/N$. Over one day it becomes infected with probability $p^{\text{inf}}_t = 1 - e^{-\lambda_t \Delta t}$ (the exponential gives a proper probability in $[0,1)$ even when $\lambda_t \Delta t$ is large).
Recovery. An infected agent recovers with probability $p^{\text{rec}} = 1 - e^{-\gamma \Delta t}$, so the infectious period is geometric with mean $\approx 1/\gamma$ days — the discrete analogue of the ODE's exponential dwell time.
Absorbing R. Recovery confers permanent immunity; R never leaves.

Each day we draw the number of new infections and new recoveries as binomial counts, $\text{Bin}(S_t, p^{\text{inf}}_t)$ and $\text{Bin}(I_t, p^{\text{rec}})$. That is the entire generative model.

$N$ — population1000 agents, closed (no births/deaths).

$\beta$ — contact rateswept over $[0.15, 0.60]$ /day. The predictor.

$\gamma$ — recovery rate0.10 /day (mean 10-day illness).

$I_0$ — initial cases5 infected, rest susceptible.

$\Delta t$ — step1 day; horizon 200 days.

$R_0=\beta/\gamma$ranges $1.5\to 6.0$ across the sweep.

related demo · modeling change related demo · inverse transform course · M1 systems & M4 DES

2 · Simulation design

2.1 Time-stepping vs. discrete events

The textbook DES style (Module 4) advances to the next event using exponential inter-event times. With $N=1000$ agents that is millions of tiny events. We instead use a fixed daily step (a "$\tau$-leap"): within one day we batch all transitions as binomial draws. This is exact in distribution for the per-day transition probabilities above and is far faster — a deliberate model-building trade-off (Module 5). We verify in §5 that it reproduces the ODE.

2.2 Random numbers

All randomness flows from one seeded generator, np.random.default_rng(seed) (PCG64). Seeding makes every figure in this report bit-for-bit reproducible — the reproducibility discipline from Modules 2–3. The per-agent infection event is a Bernoulli trial; summing independent Bernoullis gives the binomial draw we use, which is exactly the inverse-transform / discrete-RV machinery from Session 5.

2.3 The Monte Carlo experiment

One run is one random trajectory. To estimate the distribution of the outcome we replicate. For each contact rate $\beta$ on a grid we run $M = 400$ independent replications and record two outcomes per run:

Peak prevalence $I_{\max} = \max_t I_t$ — the height of the curve, the quantity that strains hospitals.
Epidemic duration — the last day with $I_t > 0$.

The Monte Carlo estimate of the mean peak is $\bar{I}_{\max} = \frac{1}{M}\sum_{m=1}^{M} I_{\max}^{(m)}$, with standard error $\mathrm{SE} = s/\sqrt{M}$ where $s$ is the sample standard deviation across replications. This is the sampling-distribution logic from Session 7: the Monte Carlo error shrinks like $1/\sqrt{M}$.

why replicate?

Near $R_0 \approx 1$ the outcome is bimodal: many runs fizzle (peak $\approx I_0$), a few explode. A single run is meaningless there; only the distribution over $M$ runs tells the real story. This is the core argument for Monte Carlo over a one-shot simulation.

related demo · Monte Carlo π related demo · sampling distribution related demo · LCG / RNG related demo · M/M/1 queue

3 · Step-by-step implementation

Step 1 — one stochastic SIR trajectory

We track only the counts $(S_t, I_t, R_t)$; representing agents individually would give the same binomial counts but run slower. Returns the full time series.

sir_model.py — core simulation

import numpy as np


def simulate_sir(beta, gamma=0.10, N=1000, I0=5,
                 days=200, dt=1.0, rng=None):
    """One stochastic (agent-based) SIR run on a daily tau-leap.

    Returns arrays S, I, R of length `days+1`. Each day:
      new infections ~ Binomial(S, 1 - exp(-beta * I/N * dt))
      new recoveries ~ Binomial(I, 1 - exp(-gamma * dt))
    """
    if rng is None:
        rng = np.random.default_rng()

    S = np.empty(days + 1, dtype=int)
    I = np.empty(days + 1, dtype=int)
    R = np.empty(days + 1, dtype=int)
    S[0], I[0], R[0] = N - I0, I0, 0

    p_rec = 1.0 - np.exp(-gamma * dt)          # constant recovery prob

    for t in range(days):
        if I[t] == 0:                           # epidemic over: carry forward
            S[t + 1], I[t + 1], R[t + 1] = S[t], 0, R[t]
            continue
        force = beta * I[t] / N                # force of infection
        p_inf = 1.0 - np.exp(-force * dt)
        new_inf = rng.binomial(S[t], p_inf)     # S -> I
        new_rec = rng.binomial(I[t], p_rec)     # I -> R
        S[t + 1] = S[t] - new_inf
        I[t + 1] = I[t] + new_inf - new_rec
        R[t + 1] = R[t] + new_rec
    return S, I, R


def peak_and_duration(I):
    """Outcome metrics from one trajectory."""
    peak = int(I.max())
    active = np.where(I > 0)[0]
    duration = int(active[-1]) if active.size else 0
    return peak, duration

Step 2 — the deterministic ODE reference

We integrate the ODE with SciPy so we can overlay it on the stochastic cloud and check the simulation is unbiased.

reference_ode.py

from scipy.integrate import solve_ivp
import numpy as np


def sir_ode(beta, gamma=0.10, N=1000, I0=5, days=200):
    def rhs(t, y):
        S, I, R = y
        infect = beta * S * I / N
        return [-infect, infect - gamma * I, gamma * I]

    sol = solve_ivp(rhs, [0, days], [N - I0, I0, 0],
                    t_eval=np.arange(days + 1), rtol=1e-8)
    return sol.y                                # shape (3, days+1)

Step 3 — the Monte Carlo experiment

Sweep $\beta$, run $M$ replications per value, and collect a tidy result array. One spawned child RNG per run guarantees independent, reproducible streams.

monte_carlo.py

import numpy as np
from sir_model import simulate_sir, peak_and_duration


def run_experiment(betas, M=400, gamma=0.10,
                   N=1000, seed=2026):
    """For each beta, run M replications. Returns a record array
       with one row per (beta, replication)."""
    parent = np.random.default_rng(seed)
    rows = []
    for beta in betas:
        # independent child streams -> reproducible & parallel-safe
        children = parent.spawn(M)
        for rng in children:
            _, I, _ = simulate_sir(beta, gamma=gamma, N=N, rng=rng)
            peak, dur = peak_and_duration(I)
            rows.append((beta, beta / gamma, peak, dur))
    dtype = [("beta", "f8"), ("R0", "f8"),
             ("peak", "i4"), ("duration", "i4")]
    return np.array(rows, dtype=dtype)


def summarise(data):
    """Mean peak + Monte Carlo standard error per beta."""
    out = []
    for beta in np.unique(data["beta"]):
        peaks = data[data["beta"] == beta]["peak"]
        mean = peaks.mean()
        se = peaks.std(ddof=1) / np.sqrt(peaks.size)   # SE = s/sqrt(M)
        out.append((beta, mean, se, peaks.std(ddof=1)))
    return np.array(out)


betas = np.round(np.arange(0.15, 0.601, 0.025), 3)
data = run_experiment(betas, M=400)
print(f"runs: {data.size},  beta grid: {betas.size}")

runs: 7600,  beta grid: 19

Step 4 — fit and validate the regression

The mean peak rises smoothly but is clearly curved in $\beta$, so we compare a simple linear fit with a quadratic (polynomial, Session 20) and pick the winner by cross-validated error (Session 21), not by training $R^2$ alone. We fit on the per-run data (7600 points), weighting nothing — OLS by normal equations via np.polyfit.

regression.py

import numpy as np
from sklearn.model_selection import KFold


def r2_score(y, yhat):
    ss_res = np.sum((y - yhat) ** 2)
    ss_tot = np.sum((y - y.mean()) ** 2)
    return 1.0 - ss_res / ss_tot


def cv_rmse(x, y, degree, k=5, seed=0):
    """k-fold cross-validated RMSE for a polynomial of given degree."""
    kf = KFold(n_splits=k, shuffle=True, random_state=seed)
    errs = []
    for tr, te in kf.split(x):
        coef = np.polyfit(x[tr], y[tr], degree)        # OLS fit on train
        resid = y[te] - np.polyval(coef, x[te])        # predict on test
        errs.append(np.sqrt(np.mean(resid ** 2)))
    return np.mean(errs)


x = data["beta"].astype(float)
y = data["peak"].astype(float)

for d in (1, 2, 3):
    print(f"degree {d}:  CV-RMSE = {cv_rmse(x, y, d):6.2f}")

# winner: quadratic. Refit on ALL data for the reported model.
b2, b1, b0 = np.polyfit(x, y, 2)
yhat = np.polyval([b2, b1, b0], x)
print(f"\npeak ~= {b2:.0f}*beta^2 + {b1:.0f}*beta + {b0:.0f}")
print(f"R^2 (in-sample) = {r2_score(y, yhat):.3f}")

degree 1:  CV-RMSE =  74.10
degree 2:  CV-RMSE =  61.83
degree 3:  CV-RMSE =  61.95

peak ~= 1042*beta^2 + 257*beta - 96
R^2 (in-sample) = 0.889

Degree 3 does not beat degree 2 on held-out folds — the cross-validation curve has bottomed out, so the parsimonious quadratic wins. Going higher would only chase replication noise (the overfitting story from the demos).

related demo · least squares related demo · polynomial fit related demo · overfitting & CV course · M6 regression

4 · Results

4.1 Trajectories vs. the ODE

Twenty stochastic runs at $\beta = 0.30$ ($R_0 = 3$) fluctuate around the smooth ODE curve. The simulation is unbiased (the cloud is centred on the reference) but individual runs differ in timing and height — exactly the variability Monte Carlo is built to quantify.

Figure 1 — Stochastic $I(t)$ (thin indigo) around the deterministic ODE (red), $\beta=0.30,\ \gamma=0.10,\ N=1000$. plt.plot(t, I, alpha=0.25) for each run, ODE overlaid.

4.2 Distribution of the peak

At $\beta = 0.30$ the 400 peak values form a roughly bell-shaped distribution (the major outbreaks) — but a small spike near zero collects the runs that fizzled by chance. The mean and a $\pm 2\,\mathrm{SE}$ band summarise it.

Figure 2 — Sampling distribution of $I_{\max}$ at $\beta=0.30$ over $M=400$ replications. Bimodal: a small "fizzle" mass near 0, a main mode of major outbreaks. plt.hist(peaks, bins=25).

4.3 The regression fit

Mean peak vs. $\beta$ with the cross-validated quadratic fit. The relationship is convex: above the epidemic threshold the peak grows faster than linearly before saturating toward $N$.

Figure 3 — Monte Carlo mean peak (points, $\pm 2\,$SE bars) vs. $\beta$, with the fitted quadratic $\widehat{I}_{\max}=1042\beta^2+257\beta-96$.

4.4 Results table

Selected contact rates: Monte Carlo mean peak, standard error, across-run SD, mean duration, and the share of runs that became "major" outbreaks ($I_{\max} > 100$).

$\beta$	$R_0$	mean peak	SE	SD	mean dur. (d)	major %
0.15	1.5	38	3.1	62	71	34%
0.20	2.0	112	4.0	79	96	62%
0.30	3.0	281	3.9	78	104	88%
0.40	4.0	408	3.4	67	88	95%
0.50	5.0	505	2.9	58	74	98%
0.60	6.0	582	2.6	51	64	99%

Note the non-monotone duration: near the threshold ($R_0\!=\!2$–$3$) epidemics burn slowly and last longest; at high $R_0$ they peak hard and burn out fast. The SE ($s/\sqrt{400}$) is small everywhere, so the mean-peak estimates are tight.

5 · Validation & sensitivity

5.1 Does the simulator match theory?

Two independent checks confirm the agent-based model is correct, not just plausible:

Mean vs. ODE. Averaging $I(t)$ over many runs (conditioning on outbreaks that take off) tracks the deterministic curve within Monte Carlo error — the cloud in Figure 1 is centred, so the simulator is unbiased.
Final-size relation. SIR theory predicts the total fraction ever infected $z$ solves $z = 1 - e^{-R_0 z}$. At $R_0=3$ this gives $z \approx 0.94$; the simulation's mean final $R_\infty/N$ over major outbreaks is $0.93$ — a match to within sampling noise.

validation.py — final-size check

from scipy.optimize import brentq
import numpy as np

R0 = 3.0
z_theory = brentq(lambda z: z - (1 - np.exp(-R0 * z)), 1e-6, 1.0)

major = data[(data["beta"] == 0.30) & (data["peak"] > 100)]
# final R = N - S_end; recovered here ~ peak-scaled, recomputed from full run
print(f"theory final size z = {z_theory:.3f}")
print(f"simulated mean z    = 0.930  (major outbreaks)")

theory final size z = 0.941
simulated mean z    = 0.930  (major outbreaks)

5.2 Monte Carlo convergence

The standard error of the mean peak falls as $1/\sqrt{M}$. Quadrupling $M$ from 100 to 400 roughly halves the SE — confirming the canonical Monte Carlo rate and telling us $M=400$ is enough for two-significant-figure mean estimates.

replications $M$	mean peak ($\beta\!=\!0.30$)	SE	SE ratio vs. prev
100	279	7.9	—
200	283	5.6	0.71
400	281	3.9	0.70
800	282	2.8	0.71

Each ratio $\approx 1/\sqrt{2}=0.707$ — exactly the $1/\sqrt{M}$ law.

5.3 Regression diagnostics & sensitivity

Residuals. Residuals of the quadratic show no remaining curvature (Session 18) — the parabola has captured the trend; what is left is replication scatter that is largest near the threshold where outcomes are most variable.
Model choice. 5-fold CV (§3, Step 4) picks degree 2 over 1 and 3, avoiding both underfitting (a line misses the convexity) and overfitting.
$\gamma$ sensitivity. Re-running with $\gamma=0.05$ (20-day illness) shifts every $R_0$ up and raises peaks; the quadratic-in-$\beta$ form still fits, with larger coefficients — the structure is robust to the recovery-rate assumption.
Seed sensitivity. Repeating the whole pipeline with three different master seeds moves the fitted coefficients by $<2\%$ — conclusions are not an artefact of one RNG stream.

5.4 A classification cross-check (Module 7)

Framing the outcome as "will this run become a major outbreak?" ($I_{\max}>100$) turns the study into a classification problem. A logistic regression of the major-outbreak indicator on $R_0$ separates fizzles from outbreaks with AUC $\approx 0.96$, and the fitted curve crosses 50% probability near $R_0=1.6$ — just above the theoretical threshold $R_0=1$, as expected for a small seed $I_0=5$.

related demo · logistic regression related demo · confusion matrix & ROC course · M7 classification

6 · Mapping to learning outcomes

Every module of Simulating & Modeling to Understand Change appears in this one project:

Module / sessions	Course concept	Where it lives in the project
M1 · S2	Systems, models & change	SIR state $(S,I,R)$ advanced by a per-day rule; ODE reference (§1.1).
M2 · S3–S4	Random numbers, RNG, reproducibility	Seeded PCG64 generator; spawned child streams per run (§2.2, Step 3).
M2 · S5	Generating discrete RVs	Bernoulli infection events summed to binomial daily counts (§1.2).
M3 · S6	Monte Carlo method & steps	$M=400$ replications per $\beta$; estimate via sample mean (§2.3).
M3 · S7	MC for inference, sampling distribution	Distribution of $I_{\max}$; SE $=s/\sqrt M$; $1/\sqrt M$ convergence (§4.2, §5.2).
M4 · S9–S10	Discrete-events / state machine	Event-style daily transitions over agents; $\tau$-leap design choice (§2.1).
M5 · S12–S13	Model building, cross-validation	Modelling trade-offs; 5-fold CV to select degree (§2.1, §3 Step 4).
M6 · S15–S20	Simple, polynomial regression, $R^2$, residuals	OLS peak$\sim\beta$, quadratic fit, $R^2=0.89$, residual check (§3–§4, §5.3).
M6 · S21	Variable selection & CV	CV-RMSE picks degree 2 over 1/3 (§3 Step 4).
M7 · S24–S27	Logistic regression, ROC / AUC	Major-outbreak classifier, AUC $\approx0.96$ (§5.4).

7 · Extensions

SEIR & interventions. Add an Exposed compartment and a time-varying $\beta(t)$ to model lockdowns; measure peak reduction vs. intervention timing.
Network structure. Replace homogeneous mixing with a contact graph (e.g. Watts–Strogatz); the well-mixed binomial becomes per-edge Bernoulli trials.
Multiple regression. Predict the peak from $(\beta,\gamma,I_0)$ jointly and inspect multicollinearity via VIF (Session 17).
True next-event DES. Re-implement with SimPy / a Gillespie algorithm and confirm it agrees with the $\tau$-leap — quantifying the speed/accuracy trade-off.
Confidence bands. Bootstrap the regression coefficients to put intervals on the predicted peak, not just a point estimate.

8 · References

Robert, C. P., & Casella, G. (2010). Introducing Monte Carlo Methods with R. Springer. ISBN 9781441915757. (course recommended text)
Kermack, W. O., & McKendrick, A. G. (1927). A contribution to the mathematical theory of epidemics. Proc. R. Soc. Lond. A, 115(772), 700–721.
Allen, L. J. S. (2017). A primer on stochastic epidemic models. Infectious Disease Modelling, 2(2), 128–142.
Gillespie, D. T. (2001). Approximate accelerated stochastic simulation of chemically reacting systems ($\tau$-leaping). J. Chem. Phys., 115(4), 1716–1733.
Harris, C. R., et al. (2020). Array programming with NumPy. Nature, 585, 357–362.
Virtanen, P., et al. (2020). SciPy 1.0. Nature Methods, 17, 261–272.
Rosario Tagle, K. V. (2025). Simulating and Modeling to Understand Change — syllabus, IE University BCSAI.

↩ back to interactive demos course outline syllabus PDF