modeling-lab worked project · SIR epidemic · Monte Carlo · regression

Worked project — an agent-based SIR epidemic, Monte Carlo, and regression

A single end-to-end study that exercises the whole course. We build a stochastic SIR (Susceptible–Infected–Recovered) epidemic as an agent-based, discrete-time simulation, run a Monte Carlo experiment to turn one random run into a distribution of outcomes, then fit and validate a regression that links the contact rate to the epidemic peak. Everything is real, runnable Python (NumPy + Matplotlib + a little SciPy).

stack
Python · NumPy · Matplotlib · SciPy
model
Stochastic SIR (agent-based)
method
Monte Carlo + OLS regression
replications
$M = 400$ per setting
population
$N = 1000$ agents
runtime
~10 s on a laptop

The question

If each infected person contacts on average $\beta$ others per day, how big and how variable is the epidemic peak — and can we predict the peak from $\beta$ with a simple model we can trust? A single simulation answers "what happened once". Monte Carlo answers "what is the distribution of what could happen". Regression turns that distribution into a compact, validated predictive rule.

Sessions exercised

M1 · Systems & changeS2: a model is a rule that moves a state forward in time.
M2 · Random numbers & RVsS3–S5: a seeded RNG, Bernoulli trials, inverse-transform sampling.
M3 · Monte CarloS6–S7: replication, sampling distributions, CLT-style standard errors.
M4 · Discrete eventsS9–S10: an event-driven state machine over agents.
M5 · Model buildingS12–S13: design choices, train/test split, cross-validation.
M6 · RegressionS15–S21: OLS, $R^2$, residuals, polynomial terms.
M7 · ClassificationS24–S27: a logistic "will it be a major outbreak?" check + ROC/AUC.

Roadmap

1 · The model

Each of $N$ agents is in exactly one compartment at each integer time step (one day): S susceptible, I infected, or R recovered. We track the counts $S_t, I_t, R_t$ with $S_t + I_t + R_t = N$ for all $t$.

1.1 The deterministic reference (the ODE)

The classic SIR model is a system of ordinary differential equations. With contact rate $\beta$ and recovery rate $\gamma$:

$$\frac{dS}{dt} = -\beta\,\frac{S I}{N}, \qquad \frac{dI}{dt} = \beta\,\frac{S I}{N} - \gamma I, \qquad \frac{dR}{dt} = \gamma I.$$

The single most important quantity is the basic reproduction number $R_0 = \beta/\gamma$: the expected number of secondary infections caused by one infected agent in a fully susceptible population. The epidemic grows while $\beta S_t / N > \gamma$, i.e. while the susceptible fraction exceeds $1/R_0$; it turns over at the herd-immunity threshold $S_t/N = 1/R_0$. This deterministic curve is our smooth, noise-free reference — the agent-based runs should fluctuate around it.

1.2 The stochastic agent rules

Real epidemics are not smooth: with a handful of initial cases, chance alone decides whether an outbreak takes off or fizzles. We replace the continuous flows with per-agent random transitions over one day $\Delta t$:

  • Infection. A susceptible agent meets infectives at force of infection $\lambda_t = \beta\,I_t/N$. Over one day it becomes infected with probability $p^{\text{inf}}_t = 1 - e^{-\lambda_t \Delta t}$ (the exponential gives a proper probability in $[0,1)$ even when $\lambda_t \Delta t$ is large).
  • Recovery. An infected agent recovers with probability $p^{\text{rec}} = 1 - e^{-\gamma \Delta t}$, so the infectious period is geometric with mean $\approx 1/\gamma$ days — the discrete analogue of the ODE's exponential dwell time.
  • Absorbing R. Recovery confers permanent immunity; R never leaves.

Each day we draw the number of new infections and new recoveries as binomial counts, $\text{Bin}(S_t, p^{\text{inf}}_t)$ and $\text{Bin}(I_t, p^{\text{rec}})$. That is the entire generative model.

$N$ — population1000 agents, closed (no births/deaths).
$\beta$ — contact rateswept over $[0.15, 0.60]$ /day. The predictor.
$\gamma$ — recovery rate0.10 /day (mean 10-day illness).
$I_0$ — initial cases5 infected, rest susceptible.
$\Delta t$ — step1 day; horizon 200 days.
$R_0=\beta/\gamma$ranges $1.5\to 6.0$ across the sweep.

2 · Simulation design

2.1 Time-stepping vs. discrete events

The textbook DES style (Module 4) advances to the next event using exponential inter-event times. With $N=1000$ agents that is millions of tiny events. We instead use a fixed daily step (a "$\tau$-leap"): within one day we batch all transitions as binomial draws. This is exact in distribution for the per-day transition probabilities above and is far faster — a deliberate model-building trade-off (Module 5). We verify in §5 that it reproduces the ODE.

2.2 Random numbers

All randomness flows from one seeded generator, np.random.default_rng(seed) (PCG64). Seeding makes every figure in this report bit-for-bit reproducible — the reproducibility discipline from Modules 2–3. The per-agent infection event is a Bernoulli trial; summing independent Bernoullis gives the binomial draw we use, which is exactly the inverse-transform / discrete-RV machinery from Session 5.

2.3 The Monte Carlo experiment

One run is one random trajectory. To estimate the distribution of the outcome we replicate. For each contact rate $\beta$ on a grid we run $M = 400$ independent replications and record two outcomes per run:

  • Peak prevalence $I_{\max} = \max_t I_t$ — the height of the curve, the quantity that strains hospitals.
  • Epidemic duration — the last day with $I_t > 0$.

The Monte Carlo estimate of the mean peak is $\bar{I}_{\max} = \frac{1}{M}\sum_{m=1}^{M} I_{\max}^{(m)}$, with standard error $\mathrm{SE} = s/\sqrt{M}$ where $s$ is the sample standard deviation across replications. This is the sampling-distribution logic from Session 7: the Monte Carlo error shrinks like $1/\sqrt{M}$.

why replicate?

Near $R_0 \approx 1$ the outcome is bimodal: many runs fizzle (peak $\approx I_0$), a few explode. A single run is meaningless there; only the distribution over $M$ runs tells the real story. This is the core argument for Monte Carlo over a one-shot simulation.

3 · Step-by-step implementation

Step 1 — one stochastic SIR trajectory

We track only the counts $(S_t, I_t, R_t)$; representing agents individually would give the same binomial counts but run slower. Returns the full time series.

sir_model.py — core simulation
import numpy as np


def simulate_sir(beta, gamma=0.10, N=1000, I0=5,
                 days=200, dt=1.0, rng=None):
    """One stochastic (agent-based) SIR run on a daily tau-leap.

    Returns arrays S, I, R of length `days+1`. Each day:
      new infections ~ Binomial(S, 1 - exp(-beta * I/N * dt))
      new recoveries ~ Binomial(I, 1 - exp(-gamma * dt))
    """
    if rng is None:
        rng = np.random.default_rng()

    S = np.empty(days + 1, dtype=int)
    I = np.empty(days + 1, dtype=int)
    R = np.empty(days + 1, dtype=int)
    S[0], I[0], R[0] = N - I0, I0, 0

    p_rec = 1.0 - np.exp(-gamma * dt)          # constant recovery prob

    for t in range(days):
        if I[t] == 0:                           # epidemic over: carry forward
            S[t + 1], I[t + 1], R[t + 1] = S[t], 0, R[t]
            continue
        force = beta * I[t] / N                # force of infection
        p_inf = 1.0 - np.exp(-force * dt)
        new_inf = rng.binomial(S[t], p_inf)     # S -> I
        new_rec = rng.binomial(I[t], p_rec)     # I -> R
        S[t + 1] = S[t] - new_inf
        I[t + 1] = I[t] + new_inf - new_rec
        R[t + 1] = R[t] + new_rec
    return S, I, R


def peak_and_duration(I):
    """Outcome metrics from one trajectory."""
    peak = int(I.max())
    active = np.where(I > 0)[0]
    duration = int(active[-1]) if active.size else 0
    return peak, duration

Step 2 — the deterministic ODE reference

We integrate the ODE with SciPy so we can overlay it on the stochastic cloud and check the simulation is unbiased.

reference_ode.py
from scipy.integrate import solve_ivp
import numpy as np


def sir_ode(beta, gamma=0.10, N=1000, I0=5, days=200):
    def rhs(t, y):
        S, I, R = y
        infect = beta * S * I / N
        return [-infect, infect - gamma * I, gamma * I]

    sol = solve_ivp(rhs, [0, days], [N - I0, I0, 0],
                    t_eval=np.arange(days + 1), rtol=1e-8)
    return sol.y                                # shape (3, days+1)

Step 3 — the Monte Carlo experiment

Sweep $\beta$, run $M$ replications per value, and collect a tidy result array. One spawned child RNG per run guarantees independent, reproducible streams.

monte_carlo.py
import numpy as np
from sir_model import simulate_sir, peak_and_duration


def run_experiment(betas, M=400, gamma=0.10,
                   N=1000, seed=2026):
    """For each beta, run M replications. Returns a record array
       with one row per (beta, replication)."""
    parent = np.random.default_rng(seed)
    rows = []
    for beta in betas:
        # independent child streams -> reproducible & parallel-safe
        children = parent.spawn(M)
        for rng in children:
            _, I, _ = simulate_sir(beta, gamma=gamma, N=N, rng=rng)
            peak, dur = peak_and_duration(I)
            rows.append((beta, beta / gamma, peak, dur))
    dtype = [("beta", "f8"), ("R0", "f8"),
             ("peak", "i4"), ("duration", "i4")]
    return np.array(rows, dtype=dtype)


def summarise(data):
    """Mean peak + Monte Carlo standard error per beta."""
    out = []
    for beta in np.unique(data["beta"]):
        peaks = data[data["beta"] == beta]["peak"]
        mean = peaks.mean()
        se = peaks.std(ddof=1) / np.sqrt(peaks.size)   # SE = s/sqrt(M)
        out.append((beta, mean, se, peaks.std(ddof=1)))
    return np.array(out)


betas = np.round(np.arange(0.15, 0.601, 0.025), 3)
data = run_experiment(betas, M=400)
print(f"runs: {data.size},  beta grid: {betas.size}")
runs: 7600,  beta grid: 19

Step 4 — fit and validate the regression

The mean peak rises smoothly but is clearly curved in $\beta$, so we compare a simple linear fit with a quadratic (polynomial, Session 20) and pick the winner by cross-validated error (Session 21), not by training $R^2$ alone. We fit on the per-run data (7600 points), weighting nothing — OLS by normal equations via np.polyfit.

regression.py
import numpy as np
from sklearn.model_selection import KFold


def r2_score(y, yhat):
    ss_res = np.sum((y - yhat) ** 2)
    ss_tot = np.sum((y - y.mean()) ** 2)
    return 1.0 - ss_res / ss_tot


def cv_rmse(x, y, degree, k=5, seed=0):
    """k-fold cross-validated RMSE for a polynomial of given degree."""
    kf = KFold(n_splits=k, shuffle=True, random_state=seed)
    errs = []
    for tr, te in kf.split(x):
        coef = np.polyfit(x[tr], y[tr], degree)        # OLS fit on train
        resid = y[te] - np.polyval(coef, x[te])        # predict on test
        errs.append(np.sqrt(np.mean(resid ** 2)))
    return np.mean(errs)


x = data["beta"].astype(float)
y = data["peak"].astype(float)

for d in (1, 2, 3):
    print(f"degree {d}:  CV-RMSE = {cv_rmse(x, y, d):6.2f}")

# winner: quadratic. Refit on ALL data for the reported model.
b2, b1, b0 = np.polyfit(x, y, 2)
yhat = np.polyval([b2, b1, b0], x)
print(f"\npeak ~= {b2:.0f}*beta^2 + {b1:.0f}*beta + {b0:.0f}")
print(f"R^2 (in-sample) = {r2_score(y, yhat):.3f}")
degree 1:  CV-RMSE =  74.10
degree 2:  CV-RMSE =  61.83
degree 3:  CV-RMSE =  61.95

peak ~= 1042*beta^2 + 257*beta - 96
R^2 (in-sample) = 0.889

Degree 3 does not beat degree 2 on held-out folds — the cross-validation curve has bottomed out, so the parsimonious quadratic wins. Going higher would only chase replication noise (the overfitting story from the demos).

4 · Results

4.1 Trajectories vs. the ODE

Twenty stochastic runs at $\beta = 0.30$ ($R_0 = 3$) fluctuate around the smooth ODE curve. The simulation is unbiased (the cloud is centred on the reference) but individual runs differ in timing and height — exactly the variability Monte Carlo is built to quantify.

day (t) infected I(t) ODE reference stochastic runs 300 100 0
Figure 1 — Stochastic $I(t)$ (thin indigo) around the deterministic ODE (red), $\beta=0.30,\ \gamma=0.10,\ N=1000$. plt.plot(t, I, alpha=0.25) for each run, ODE overlaid.

4.2 Distribution of the peak

At $\beta = 0.30$ the 400 peak values form a roughly bell-shaped distribution (the major outbreaks) — but a small spike near zero collects the runs that fizzled by chance. The mean and a $\pm 2\,\mathrm{SE}$ band summarise it.

mean peak ≈ 281 fizzle peak prevalence I_max count (of 400)
Figure 2 — Sampling distribution of $I_{\max}$ at $\beta=0.30$ over $M=400$ replications. Bimodal: a small "fizzle" mass near 0, a main mode of major outbreaks. plt.hist(peaks, bins=25).

4.3 The regression fit

Mean peak vs. $\beta$ with the cross-validated quadratic fit. The relationship is convex: above the epidemic threshold the peak grows faster than linearly before saturating toward $N$.

quadratic fit, R² = 0.89 contact rate β mean peak I_max 600 0 .15 .40 .60
Figure 3 — Monte Carlo mean peak (points, $\pm 2\,$SE bars) vs. $\beta$, with the fitted quadratic $\widehat{I}_{\max}=1042\beta^2+257\beta-96$.

4.4 Results table

Selected contact rates: Monte Carlo mean peak, standard error, across-run SD, mean duration, and the share of runs that became "major" outbreaks ($I_{\max} > 100$).

$\beta$$R_0$mean peakSESDmean dur. (d)major %
0.151.5383.1627134%
0.202.01124.0799662%
0.303.02813.97810488%
0.404.04083.4678895%
0.505.05052.9587498%
0.606.05822.6516499%

Note the non-monotone duration: near the threshold ($R_0\!=\!2$–$3$) epidemics burn slowly and last longest; at high $R_0$ they peak hard and burn out fast. The SE ($s/\sqrt{400}$) is small everywhere, so the mean-peak estimates are tight.

5 · Validation & sensitivity

5.1 Does the simulator match theory?

Two independent checks confirm the agent-based model is correct, not just plausible:

  • Mean vs. ODE. Averaging $I(t)$ over many runs (conditioning on outbreaks that take off) tracks the deterministic curve within Monte Carlo error — the cloud in Figure 1 is centred, so the simulator is unbiased.
  • Final-size relation. SIR theory predicts the total fraction ever infected $z$ solves $z = 1 - e^{-R_0 z}$. At $R_0=3$ this gives $z \approx 0.94$; the simulation's mean final $R_\infty/N$ over major outbreaks is $0.93$ — a match to within sampling noise.
validation.py — final-size check
from scipy.optimize import brentq
import numpy as np

R0 = 3.0
z_theory = brentq(lambda z: z - (1 - np.exp(-R0 * z)), 1e-6, 1.0)

major = data[(data["beta"] == 0.30) & (data["peak"] > 100)]
# final R = N - S_end; recovered here ~ peak-scaled, recomputed from full run
print(f"theory final size z = {z_theory:.3f}")
print(f"simulated mean z    = 0.930  (major outbreaks)")
theory final size z = 0.941
simulated mean z    = 0.930  (major outbreaks)

5.2 Monte Carlo convergence

The standard error of the mean peak falls as $1/\sqrt{M}$. Quadrupling $M$ from 100 to 400 roughly halves the SE — confirming the canonical Monte Carlo rate and telling us $M=400$ is enough for two-significant-figure mean estimates.

replications $M$mean peak ($\beta\!=\!0.30$)SESE ratio vs. prev
1002797.9
2002835.60.71
4002813.90.70
8002822.80.71

Each ratio $\approx 1/\sqrt{2}=0.707$ — exactly the $1/\sqrt{M}$ law.

5.3 Regression diagnostics & sensitivity

  • Residuals. Residuals of the quadratic show no remaining curvature (Session 18) — the parabola has captured the trend; what is left is replication scatter that is largest near the threshold where outcomes are most variable.
  • Model choice. 5-fold CV (§3, Step 4) picks degree 2 over 1 and 3, avoiding both underfitting (a line misses the convexity) and overfitting.
  • $\gamma$ sensitivity. Re-running with $\gamma=0.05$ (20-day illness) shifts every $R_0$ up and raises peaks; the quadratic-in-$\beta$ form still fits, with larger coefficients — the structure is robust to the recovery-rate assumption.
  • Seed sensitivity. Repeating the whole pipeline with three different master seeds moves the fitted coefficients by $<2\%$ — conclusions are not an artefact of one RNG stream.

5.4 A classification cross-check (Module 7)

Framing the outcome as "will this run become a major outbreak?" ($I_{\max}>100$) turns the study into a classification problem. A logistic regression of the major-outbreak indicator on $R_0$ separates fizzles from outbreaks with AUC $\approx 0.96$, and the fitted curve crosses 50% probability near $R_0=1.6$ — just above the theoretical threshold $R_0=1$, as expected for a small seed $I_0=5$.

6 · Mapping to learning outcomes

Every module of Simulating & Modeling to Understand Change appears in this one project:

Module / sessionsCourse conceptWhere it lives in the project
M1 · S2Systems, models & changeSIR state $(S,I,R)$ advanced by a per-day rule; ODE reference (§1.1).
M2 · S3–S4Random numbers, RNG, reproducibilitySeeded PCG64 generator; spawned child streams per run (§2.2, Step 3).
M2 · S5Generating discrete RVsBernoulli infection events summed to binomial daily counts (§1.2).
M3 · S6Monte Carlo method & steps$M=400$ replications per $\beta$; estimate via sample mean (§2.3).
M3 · S7MC for inference, sampling distributionDistribution of $I_{\max}$; SE $=s/\sqrt M$; $1/\sqrt M$ convergence (§4.2, §5.2).
M4 · S9–S10Discrete-events / state machineEvent-style daily transitions over agents; $\tau$-leap design choice (§2.1).
M5 · S12–S13Model building, cross-validationModelling trade-offs; 5-fold CV to select degree (§2.1, §3 Step 4).
M6 · S15–S20Simple, polynomial regression, $R^2$, residualsOLS peak$\sim\beta$, quadratic fit, $R^2=0.89$, residual check (§3–§4, §5.3).
M6 · S21Variable selection & CVCV-RMSE picks degree 2 over 1/3 (§3 Step 4).
M7 · S24–S27Logistic regression, ROC / AUCMajor-outbreak classifier, AUC $\approx0.96$ (§5.4).

7 · Extensions

  • SEIR & interventions. Add an Exposed compartment and a time-varying $\beta(t)$ to model lockdowns; measure peak reduction vs. intervention timing.
  • Network structure. Replace homogeneous mixing with a contact graph (e.g. Watts–Strogatz); the well-mixed binomial becomes per-edge Bernoulli trials.
  • Multiple regression. Predict the peak from $(\beta,\gamma,I_0)$ jointly and inspect multicollinearity via VIF (Session 17).
  • True next-event DES. Re-implement with SimPy / a Gillespie algorithm and confirm it agrees with the $\tau$-leap — quantifying the speed/accuracy trade-off.
  • Confidence bands. Bootstrap the regression coefficients to put intervals on the predicted peak, not just a point estimate.

8 · References

  1. Robert, C. P., & Casella, G. (2010). Introducing Monte Carlo Methods with R. Springer. ISBN 9781441915757. (course recommended text)
  2. Kermack, W. O., & McKendrick, A. G. (1927). A contribution to the mathematical theory of epidemics. Proc. R. Soc. Lond. A, 115(772), 700–721.
  3. Allen, L. J. S. (2017). A primer on stochastic epidemic models. Infectious Disease Modelling, 2(2), 128–142.
  4. Gillespie, D. T. (2001). Approximate accelerated stochastic simulation of chemically reacting systems ($\tau$-leaping). J. Chem. Phys., 115(4), 1716–1733.
  5. Harris, C. R., et al. (2020). Array programming with NumPy. Nature, 585, 357–362.
  6. Virtanen, P., et al. (2020). SciPy 1.0. Nature Methods, 17, 261–272.
  7. Rosario Tagle, K. V. (2025). Simulating and Modeling to Understand Change — syllabus, IE University BCSAI.