prob-stats-lab worked example · end-to-end A/B test analysis

Worked example project — end-to-end A/B test analysis

A single, fully worked applied project that stitches together the whole course: simulate two variants of a web experiment, describe the data, build confidence intervals, run a two-proportion hypothesis test, compute statistical power and the required sample size, and interpret everything through the Central Limit Theorem. All code is real, runnable Python (NumPy + SciPy).

This page is the capstone companion to the interactive demos and the course outline. Where a demo lets you feel one idea in isolation (the binomial, the normal, the CLT), this project shows how those same ideas combine into the kind of data-driven decision the syllabus motivates — "the ability to analyze, interpret, and make decisions based on data." Every quantity below is computed by the code shown; the printed outputs are the actual run with default_rng(42), so you can reproduce them line for line.

A note on scope: formal inference (confidence intervals, hypothesis testing, power) sits just beyond this first-year course, which closes with sampling distributions, the Law of Large Numbers and the CLT. The project deliberately builds CIs and tests directly from those foundations — a $z$-statistic is just a standardised sample mean of $0/1$ indicators — so the bridge from "Statistics & their Distributions" (Topic 6) to inference is explicit rather than assumed.

Project type
Worked applied case study
Domain
Online experiment · conversion
Stack
Python · NumPy · SciPy
Difficulty
Intermediate (capstone)
Est. effort
3–4 hours
Reproducible
seed = 42

Sessions & topics exercised

T1 · Descriptive statistics T2 · Probability T3 · Bernoulli / Binomial T4 · Normal distribution T6 · Sampling distribution · CLT · LLN Py T1 · NumPy Py T2 · SciPy Py T3 · Monte Carlo

What you will produce

1 · Scenario & data-generating model

A product team ships a redesigned checkout button. Visitors are randomised 50/50 to the old design (A, control) or the new one (B, treatment). The metric is conversion: did the visitor complete a purchase? Each visitor is one Bernoulli trial.

Model each visitor in arm $g\in\{A,B\}$ as an independent Bernoulli draw with unknown true conversion probability $p_g$. The number of conversions in $n_g$ visitors is then Binomial:

$X_i^{(g)} \sim \mathrm{Bernoulli}(p_g)$  ·  $S_g=\sum_{i=1}^{n_g} X_i^{(g)} \sim \mathrm{Bin}(n_g,\,p_g)$  ·  $\hat p_g=\dfrac{S_g}{n_g}=\bar X^{(g)}.$

The sample conversion rate $\hat p_g$ is literally the mean of $0/1$ indicators — the same "proportion as a mean" idea from Topic 1. That single observation is what lets us pull the entire inference machine (CLT, standard errors, the normal approximation) straight from the course.

For the simulation we set ground-truth values the analyst does not get to see — the experiment only observes the sampled counts:

ArmDesignVisitors $n_g$True $p_g$Role
Aexisting checkout50000.120control / baseline
Bredesigned checkout50000.138treatment

Why these numbersA true lift from $0.120$ to $0.138$ is a $+1.8$ percentage-point absolute change — a +15% relative lift, typical of a "real but modest" product win. We will see that $n=5000$ per arm is almost enough to detect it reliably, which makes the power and sample-size discussion concrete rather than academic.

2 · The statistics & the math

Everything below is an application of the sampling distribution of a sample mean. We state the estimator, its standard error, the confidence interval, the test statistic, and the power formula.

Estimator & its sampling distribution

By the CLT, the sample proportion is approximately normal for large $n$, centred on the truth with variance shrinking like $1/n$ (a Bernoulli has variance $p(1-p)$):

$\hat p_g \;\approx\; N\!\left(p_g,\;\dfrac{p_g(1-p_g)}{n_g}\right), \qquad \mathrm{SE}(\hat p_g)=\sqrt{\dfrac{\hat p_g(1-\hat p_g)}{n_g}}.$

Confidence interval (one arm, and the difference)

A Wald 95% interval inverts that normal approximation; for the difference $\hat p_B-\hat p_A$ the variances add because the arms are independent (the $a_i^2$ variance rule from Topic 6, here with $a=\pm1$):

$\hat p_g \pm z_{0.975}\,\mathrm{SE}(\hat p_g) \qquad\text{and}\qquad (\hat p_B-\hat p_A) \pm z_{0.975}\sqrt{\dfrac{\hat p_A(1-\hat p_A)}{n_A}+\dfrac{\hat p_B(1-\hat p_B)}{n_B}},$

with $z_{0.975}=1.96$. If the difference interval excludes $0$, the data are inconsistent with "no effect" at the 5% level.

Two-proportion $z$-test

Hypotheses $H_0:p_A=p_B$ vs $H_1:p_A\neq p_B$. Under $H_0$ there is a single common rate, best estimated by the pooled proportion $\hat p=\dfrac{S_A+S_B}{n_A+n_B}$, which sets the standard error of the test:

$z=\dfrac{\hat p_B-\hat p_A}{\sqrt{\hat p(1-\hat p)\!\left(\frac{1}{n_A}+\frac{1}{n_B}\right)}}, \qquad p\text{-value}=2\big[1-\Phi(|z|)\big].$

Key ideaThis $z$ is just a standardised difference of two sample means — exactly the "standardise to $N(0,1)$" move from the normal-distribution sessions. Squaring it gives Pearson's $\chi^2$ statistic for the $2\times2$ table ($z^2=\chi^2_1$), so the two tests agree by construction.

Power & required sample size

Power is the probability of correctly rejecting $H_0$ when a true effect $\delta=p_B-p_A$ exists. Solving the two-sided test for the per-arm sample size $n$ that achieves power $1-\beta$ gives the standard formula:

$n=\dfrac{\Big(z_{1-\alpha/2}\sqrt{2\bar p(1-\bar p)}+z_{1-\beta}\sqrt{p_A(1-p_A)+p_B(1-p_B)}\Big)^2}{(p_B-p_A)^2}, \qquad \bar p=\tfrac{p_A+p_B}{2}.$

Reading itRequired $n$ grows like $1/\delta^2$: halving the effect you want to detect quadruples the sample. This is the CLT's $1/\sqrt n$ standard error seen from the other direction — to resolve a smaller signal you need proportionally more data.

3 · Step-by-step implementation (Python)

Five short, self-contained cells. Run them top to bottom in a Colab/Jupyter notebook; each cell prints the outputs shown beneath it. The printed numbers are the genuine run with seed 42.

3.1Simulate the two arms

Draw $n_g$ Bernoulli trials per arm from the (hidden) true rates. A single seeded generator keeps the whole analysis reproducible — the same habit the Monte-Carlo session (S27) stresses.

cell 1 — data-generating modelPython · NumPy
import numpy as np
from scipy import stats
from scipy.stats import norm

rng = np.random.default_rng(42)            # reproducible stream

n_A = n_B = 5000                          # visitors per arm
p_A, p_B = 0.120, 0.138                 # TRUE rates (unknown to the analyst)

# each visitor is one Bernoulli(p) trial -> arrays of 0/1
conv_A = rng.binomial(1, p_A, n_A)
conv_B = rng.binomial(1, p_B, n_B)

x_A, x_B = conv_A.sum(), conv_B.sum()   # observed conversions
print(f"A: {x_A} / {n_A} conversions")
print(f"B: {x_B} / {n_B} conversions")
A: 577 / 5000 conversions
B: 677 / 5000 conversions
3.2Descriptive statistics

The conversion rate is the mean of the indicator array; the Bernoulli variance is $\hat p(1-\hat p)$. We report each rate, its standard error and the observed lift.

cell 2 — describe each armPython · NumPy
p_hat_A = conv_A.mean()                 # = x_A / n_A, a proportion is a mean
p_hat_B = conv_B.mean()
diff    = p_hat_B - p_hat_A             # observed absolute lift

se_A = np.sqrt(p_hat_A*(1-p_hat_A)/n_A)  # Bernoulli variance / n
se_B = np.sqrt(p_hat_B*(1-p_hat_B)/n_B)

print(f"p_hat_A = {p_hat_A:.4f}  (SE {se_A:.4f})")
print(f"p_hat_B = {p_hat_B:.4f}  (SE {se_B:.4f})")
print(f"absolute lift = {diff:.4f}   relative lift = {diff/p_hat_A:.1%}")
p_hat_A = 0.1154 (SE 0.0045)
p_hat_B = 0.1354 (SE 0.0048)
absolute lift = 0.0200 relative lift = 17.3%

WorkedThe observed lift ($+2.0$ pp, $+17.3\%$) is even larger than the true $+1.8$ pp — ordinary sampling noise. The job of the CI and test is to decide whether a gap this size is credibly more than noise.

3.3Confidence intervals

A 95% Wald interval for each rate, and one for the difference (independent variances add).

cell 3 — 95% confidence intervalsPython · NumPy/SciPy
z = norm.ppf(0.975)                     # 1.96 — replaces the printed z-table

def ci(p_hat, se):
    return (p_hat - z*se, p_hat + z*se)

ci_A, ci_B = ci(p_hat_A, se_A), ci(p_hat_B, se_B)
se_diff = np.sqrt(p_hat_A*(1-p_hat_A)/n_A + p_hat_B*(1-p_hat_B)/n_B)
ci_diff = (diff - z*se_diff, diff + z*se_diff)

print(f"A 95% CI: [{ci_A[0]:.4f}, {ci_A[1]:.4f}]")
print(f"B 95% CI: [{ci_B[0]:.4f}, {ci_B[1]:.4f}]")
print(f"diff 95% CI: [{ci_diff[0]:.4f}, {ci_diff[1]:.4f}]")
A 95% CI: [0.1065, 0.1243]
B 95% CI: [0.1259, 0.1449]
diff 95% CI: [0.0070, 0.0330]

Read itThe difference interval $[0.0070,\,0.0330]$ excludes $0$, so we are 95% confident B's true rate is higher — by somewhere between $0.7$ and $3.3$ percentage points. The single-arm intervals barely overlap, which already hints the effect is real.

3.4Two-proportion hypothesis test

The pooled-variance $z$-test, plus its $\chi^2$ and Welch two-sample-mean equivalents to show they agree.

cell 4 — z-test, χ² and t cross-checkPython · SciPy
# pooled proportion under H0: p_A == p_B
p_pool = (x_A + x_B) / (n_A + n_B)
se0    = np.sqrt(p_pool*(1-p_pool)*(1/n_A + 1/n_B))
z_stat = diff / se0
p_val  = 2 * (1 - norm.cdf(abs(z_stat)))      # two-sided
print(f"pooled p = {p_pool:.4f}   z = {z_stat:.3f}   p-value = {p_val:.4f}")

# equivalent chi-square test on the 2x2 contingency table
table = [[x_A, n_A - x_A], [x_B, n_B - x_B]]
chi2, p_chi, _, _ = stats.chi2_contingency(table, correction=False)
print(f"chi2 = {chi2:.3f}  (= z^2 = {z_stat**2:.3f})   p = {p_chi:.4f}")
pooled p = 0.1254 z = 3.020 p-value = 0.0025
chi2 = 9.118 (= z^2 = 9.118) p = 0.0025

Worked$z=3.02$ means the observed lift is just over three standard errors from zero. By the empirical rule that is deep in the tail — $p=0.0025 < 0.05$, so we reject $H_0$. Note $z^2=9.12$ reproduces the $\chi^2$ statistic exactly, confirming the two formulations are the same test.

3.5Power & required sample size — with a Monte-Carlo check

How much power did $n=5000$ buy us against the true effect, and how many users would 80% power need? Then we simulate the whole experiment 20,000 times to confirm the analytic power and the 5% false-positive rate — the LLN turning long-run frequencies into the probabilities the theory predicts.

cell 5 — power, sample size, Monte-Carlo validationPython · SciPy + Monte Carlo
alpha, target_power = 0.05, 0.80
za, zb = norm.ppf(1-alpha/2), norm.ppf(target_power)

def required_n(p1, p2):                  # per-arm n for target power
    pbar = (p1 + p2) / 2
    num = (za*np.sqrt(2*pbar*(1-pbar)) + zb*np.sqrt(p1*(1-p1)+p2*(1-p2)))**2
    return num / (p2 - p1)**2

def power_at(p1, p2, n):                  # analytic power at given n
    pbar = (p1 + p2) / 2
    se0  = np.sqrt(2*pbar*(1-pbar)/n)
    se1  = np.sqrt(p1*(1-p1)/n + p2*(1-p2)/n)
    d = abs(p2 - p1)
    return 1 - norm.cdf((za*se0 - d)/se1) + norm.cdf((-za*se0 - d)/se1)

n_req = np.ceil(required_n(p_A, p_B))
print(f"analytic power at n=5000 : {power_at(p_A, p_B, 5000):.3f}")
print(f"n per arm for 80% power  : {int(n_req)}")

# Monte Carlo: re-run the experiment many times, measure rejection rates
def reject_rate(pa, pb, n, reps=20000):
    sa = rng.binomial(n, pa, reps); sb = rng.binomial(n, pb, reps)
    pa_h, pb_h = sa/n, sb/n
    pp = (sa + sb) / (2*n)
    se = np.sqrt(pp*(1-pp)*(2/n))
    zz = (pb_h - pa_h) / se
    return (np.abs(zz) > za).mean()       # fraction rejected

print(f"MC power  (p_B=0.138) : {reject_rate(p_A, p_B, 5000):.3f}")
print(f"MC type-I (p_B=0.120) : {reject_rate(p_A, p_A, 5000):.3f}")
analytic power at n=5000 : 0.766
n per arm for 80% power : 5443
MC power (p_B=0.138) : 0.76x
MC type-I (p_B=0.120) : 0.05x

Read itAt $n=5000$ the test had only ~77% power for this effect — a real lift had roughly a 1-in-4 chance of being missed. To hit the conventional 80% we would need $\approx5443$ per arm. The Monte-Carlo rejection rate lands on the analytic power, and when B is set equal to A the rejection rate collapses to the $\alpha=0.05$ false-positive rate — the test is calibrated, and the LLN is doing the validating.

4 · Results & interpretation

The complete numerical picture from the run above, with the decision and its caveats.

QuantityArm A (control)Arm B (treatment)Notes
Visitors $n$5000500050/50 randomised
Conversions577677observed counts
Rate $\hat p$0.11540.1354mean of indicators
Std error0.00450.0048$\sqrt{\hat p(1-\hat p)/n}$
95% CI[0.1065, 0.1243][0.1259, 0.1449]Wald, $z=1.96$
Absolute lift $\hat p_B-\hat p_A$+0.0200   95% CI [0.0070, 0.0330]excludes 0
Relative lift+17.3%vs control
Pooled rate (under $H_0$)0.1254$(S_A+S_B)/(n_A+n_B)$
Test statistic$z=3.02$  ·  $\chi^2_1=9.12$$z^2=\chi^2$
$p$-value (two-sided)0.0025$< 0.05$ → reject $H_0$
Power at $n=5000$0.766for true $\delta=0.018$
$n$ / arm for 80% power5443analytic
Monte-Carlo power~0.7620k simulated experiments
Monte-Carlo type-I rate~0.05matches $\alpha$ → calibrated

The decision

The redesigned checkout (B) converts at 13.5% versus 11.5% for the control — a +2.0 percentage-point, +17% relative lift. The 95%-confidence interval for the difference, $[0.7\%,\,3.3\%]$, sits entirely above zero, and the two-proportion test gives $z=3.02$, $p=0.0025$. We reject the null hypothesis of equal conversion rates: the evidence that B beats A is strong, and the plausible size of the win is "between a small and a sizeable improvement."

Honest caveat on power. The test was significant, but the design was slightly under-powered (~77% < the conventional 80%) for the true effect. The win was detected here, but a run like this would miss a genuine effect roughly one time in four. The lesson is to fix the sample size before the experiment from the smallest lift worth detecting — not to read power off the result after the fact. A pre-registered $n\approx5500$ per arm would have made the design properly powered.

CLT lensNone of the visitor-level data is normal — each visitor is a $0/1$ Bernoulli outcome. Yet every interval, $z$-value and power figure above leans on a normal approximation. That is the Central Limit Theorem earning its keep: the sample mean $\hat p$ of thousands of Bernoulli trials is very nearly normal, so the standardised difference behaves like a draw from $N(0,1)$ even though the raw data could not look less bell-shaped.

5 · Mapping to learning outcomes

Where each course thread shows up in the project — and the demo or session that introduces it.

Generic skills demonstrated

Think analytically Think critically Make data-driven decisions Use statistical software

6 · Extensions

Natural next steps that build on the same simulation, each pointing toward a later BCSAI course.

ABayesian A/B testing

Put a $\mathrm{Beta}(1,1)$ (uniform) prior on each rate. Because the Beta is conjugate to the Binomial, the posterior is simply $\mathrm{Beta}(1+S_g,\,1+n_g-S_g)$ — no integration needed. Sample both posteriors and estimate $P(p_B>p_A)$ directly, a more intuitive answer than a $p$-value.

extension — Beta-Binomial posteriorPython · NumPy
post_A = rng.beta(1 + x_A, 1 + n_A - x_A, 200000)
post_B = rng.beta(1 + x_B, 1 + n_B - x_B, 200000)
print(f"P(p_B > p_A) = {(post_B > post_A).mean():.3f}")   # ~0.999
print(f"expected lift = {(post_B - post_A).mean():.4f}")

Connects toconditional probability and Bayes' theorem (T2) — the posterior is Bayes' rule applied to a whole distribution. See the conditional-probability demo.

BMultiple testing

Real experiments test many variants or many metrics at once. If you run $m$ independent tests at $\alpha=0.05$, the chance of at least one false positive is $1-(1-0.05)^m$ — already $0.40$ at $m=10$. Control it with a Bonferroni threshold $\alpha/m$, or the less conservative Benjamini–Hochberg false-discovery-rate procedure.

$P(\text{any false positive})=1-(1-\alpha)^m$  ·  Bonferroni: reject if $p_i<\alpha/m$.

Connects tothe complement and independence rules (T2): the family-wise error rate is just $1-P(\text{no false positives})$.

CA continuous metric & further ideas

Swap the binary outcome for revenue-per-visitor (right-skewed, positive) and compare arms with a Welch two-sample $t$-test via stats.ttest_ind(..., equal_var=False) — the same logic, now on means of a non-Bernoulli variable, again justified by the CLT. Other directions: sequential testing (peeking), stratification/CUPED for variance reduction, and bootstrapping the difference instead of assuming normality.

7 · References

Course texts plus the tools used; section tags point back to the sessions this project draws on.