prob-stats-lab · worked example project — end-to-end A/B test analysis

Worked example project — end-to-end A/B test analysis

A single, fully worked applied project that stitches together the whole course: simulate two variants of a web experiment, describe the data, build confidence intervals, run a two-proportion hypothesis test, compute statistical power and the required sample size, and interpret everything through the Central Limit Theorem. All code is real, runnable Python (NumPy + SciPy).

This page is the capstone companion to the interactive demos and the course outline. Where a demo lets you feel one idea in isolation (the binomial, the normal, the CLT), this project shows how those same ideas combine into the kind of data-driven decision the syllabus motivates — "the ability to analyze, interpret, and make decisions based on data." Every quantity below is computed by the code shown; the printed outputs are the actual run with default_rng(42), so you can reproduce them line for line.

A note on scope: formal inference (confidence intervals, hypothesis testing, power) sits just beyond this first-year course, which closes with sampling distributions, the Law of Large Numbers and the CLT. The project deliberately builds CIs and tests directly from those foundations — a $z$-statistic is just a standardised sample mean of $0/1$ indicators — so the bridge from "Statistics & their Distributions" (Topic 6) to inference is explicit rather than assumed.

Project type

Worked applied case study

Domain

Online experiment · conversion

Stack

Python · NumPy · SciPy

Difficulty

Intermediate (capstone)

Est. effort

3–4 hours

Reproducible

seed = 42

Sessions & topics exercised

T1 · Descriptive statistics T2 · Probability T3 · Bernoulli / Binomial T4 · Normal distribution T6 · Sampling distribution · CLT · LLN Py T1 · NumPy Py T2 · SciPy Py T3 · Monte Carlo

What you will produce

A reproducible data-generating model for two experiment arms (control A vs treatment B).
Descriptive summaries and 95% confidence intervals for each conversion rate and for their difference.
A two-proportion $z$-test (with its $\chi^2$ and two-sample-mean equivalents) and a clear decision.
A power calculation and the required sample size to detect the effect with 80% power.
A Monte-Carlo check of the test's false-positive and detection rates — the LLN validating the theory.

1 · Scenario & data-generating model

A product team ships a redesigned checkout button. Visitors are randomised 50/50 to the old design (A, control) or the new one (B, treatment). The metric is conversion: did the visitor complete a purchase? Each visitor is one Bernoulli trial.

Model each visitor in arm $g\in\{A,B\}$ as an independent Bernoulli draw with unknown true conversion probability $p_g$. The number of conversions in $n_g$ visitors is then Binomial:

$X_i^{(g)} \sim \mathrm{Bernoulli}(p_g)$ · $S_g=\sum_{i=1}^{n_g} X_i^{(g)} \sim \mathrm{Bin}(n_g,\,p_g)$ · $\hat p_g=\dfrac{S_g}{n_g}=\bar X^{(g)}.$

The sample conversion rate $\hat p_g$ is literally the mean of $0/1$ indicators — the same "proportion as a mean" idea from Topic 1. That single observation is what lets us pull the entire inference machine (CLT, standard errors, the normal approximation) straight from the course.

For the simulation we set ground-truth values the analyst does not get to see — the experiment only observes the sampled counts:

Arm	Design	Visitors $n_g$	True $p_g$	Role
A	existing checkout	5000	0.120	control / baseline
B	redesigned checkout	5000	0.138	treatment

Why these numbersA true lift from $0.120$ to $0.138$ is a $+1.8$ percentage-point absolute change — a +15% relative lift, typical of a "real but modest" product win. We will see that $n=5000$ per arm is almost enough to detect it reliably, which makes the power and sample-size discussion concrete rather than academic.

↳ demo: binomial distribution ↳ demo: Law of Large Numbers ↳ course: S10 Bernoulli & Binomial

2 · The statistics & the math

Everything below is an application of the sampling distribution of a sample mean. We state the estimator, its standard error, the confidence interval, the test statistic, and the power formula.

Estimator & its sampling distribution

By the CLT, the sample proportion is approximately normal for large $n$, centred on the truth with variance shrinking like $1/n$ (a Bernoulli has variance $p(1-p)$):

$\hat p_g \;\approx\; N\!\left(p_g,\;\dfrac{p_g(1-p_g)}{n_g}\right), \qquad \mathrm{SE}(\hat p_g)=\sqrt{\dfrac{\hat p_g(1-\hat p_g)}{n_g}}.$

Confidence interval (one arm, and the difference)

A Wald 95% interval inverts that normal approximation; for the difference $\hat p_B-\hat p_A$ the variances add because the arms are independent (the $a_i^2$ variance rule from Topic 6, here with $a=\pm1$):

$\hat p_g \pm z_{0.975}\,\mathrm{SE}(\hat p_g) \qquad\text{and}\qquad (\hat p_B-\hat p_A) \pm z_{0.975}\sqrt{\dfrac{\hat p_A(1-\hat p_A)}{n_A}+\dfrac{\hat p_B(1-\hat p_B)}{n_B}},$

with $z_{0.975}=1.96$. If the difference interval excludes $0$, the data are inconsistent with "no effect" at the 5% level.

Two-proportion $z$-test

Hypotheses $H_0:p_A=p_B$ vs $H_1:p_A\neq p_B$. Under $H_0$ there is a single common rate, best estimated by the pooled proportion $\hat p=\dfrac{S_A+S_B}{n_A+n_B}$, which sets the standard error of the test:

$z=\dfrac{\hat p_B-\hat p_A}{\sqrt{\hat p(1-\hat p)\!\left(\frac{1}{n_A}+\frac{1}{n_B}\right)}}, \qquad p\text{-value}=2\big[1-\Phi(|z|)\big].$

Key ideaThis $z$ is just a standardised difference of two sample means — exactly the "standardise to $N(0,1)$" move from the normal-distribution sessions. Squaring it gives Pearson's $\chi^2$ statistic for the $2\times2$ table ($z^2=\chi^2_1$), so the two tests agree by construction.

Power & required sample size

Power is the probability of correctly rejecting $H_0$ when a true effect $\delta=p_B-p_A$ exists. Solving the two-sided test for the per-arm sample size $n$ that achieves power $1-\beta$ gives the standard formula:

$n=\dfrac{\Big(z_{1-\alpha/2}\sqrt{2\bar p(1-\bar p)}+z_{1-\beta}\sqrt{p_A(1-p_A)+p_B(1-p_B)}\Big)^2}{(p_B-p_A)^2}, \qquad \bar p=\tfrac{p_A+p_B}{2}.$

Reading itRequired $n$ grows like $1/\delta^2$: halving the effect you want to detect quadruples the sample. This is the CLT's $1/\sqrt n$ standard error seen from the other direction — to resolve a smaller signal you need proportionally more data.

↳ demo: Central Limit Theorem ↳ demo: normal & the empirical rule ↳ course: S23–S24 sampling distribution & CLT

3 · Step-by-step implementation (Python)

Five short, self-contained cells. Run them top to bottom in a Colab/Jupyter notebook; each cell prints the outputs shown beneath it. The printed numbers are the genuine run with seed 42.

3.1Simulate the two arms

Draw $n_g$ Bernoulli trials per arm from the (hidden) true rates. A single seeded generator keeps the whole analysis reproducible — the same habit the Monte-Carlo session (S27) stresses.

cell 1 — data-generating modelPython · NumPy

import numpy as np
from scipy import stats
from scipy.stats import norm

rng = np.random.default_rng(42)            # reproducible stream

n_A = n_B = 5000                          # visitors per arm
p_A, p_B = 0.120, 0.138                 # TRUE rates (unknown to the analyst)

# each visitor is one Bernoulli(p) trial -> arrays of 0/1
conv_A = rng.binomial(1, p_A, n_A)
conv_B = rng.binomial(1, p_B, n_B)

x_A, x_B = conv_A.sum(), conv_B.sum()   # observed conversions
print(f"A: {x_A} / {n_A} conversions")
print(f"B: {x_B} / {n_B} conversions")

A: 577 / 5000 conversions
B: 677 / 5000 conversions

3.2Descriptive statistics

The conversion rate is the mean of the indicator array; the Bernoulli variance is $\hat p(1-\hat p)$. We report each rate, its standard error and the observed lift.

cell 2 — describe each armPython · NumPy

p_hat_A = conv_A.mean()                 # = x_A / n_A, a proportion is a mean
p_hat_B = conv_B.mean()
diff    = p_hat_B - p_hat_A             # observed absolute lift

se_A = np.sqrt(p_hat_A*(1-p_hat_A)/n_A)  # Bernoulli variance / n
se_B = np.sqrt(p_hat_B*(1-p_hat_B)/n_B)

print(f"p_hat_A = {p_hat_A:.4f}  (SE {se_A:.4f})")
print(f"p_hat_B = {p_hat_B:.4f}  (SE {se_B:.4f})")
print(f"absolute lift = {diff:.4f}   relative lift = {diff/p_hat_A:.1%}")

p_hat_A = 0.1154 (SE 0.0045)
p_hat_B = 0.1354 (SE 0.0048)
absolute lift = 0.0200 relative lift = 17.3%

WorkedThe observed lift ($+2.0$ pp, $+17.3\%$) is even larger than the true $+1.8$ pp — ordinary sampling noise. The job of the CI and test is to decide whether a gap this size is credibly more than noise.

3.3Confidence intervals

A 95% Wald interval for each rate, and one for the difference (independent variances add).

cell 3 — 95% confidence intervalsPython · NumPy/SciPy

z = norm.ppf(0.975)                     # 1.96 — replaces the printed z-table

def ci(p_hat, se):
    return (p_hat - z*se, p_hat + z*se)

ci_A, ci_B = ci(p_hat_A, se_A), ci(p_hat_B, se_B)
se_diff = np.sqrt(p_hat_A*(1-p_hat_A)/n_A + p_hat_B*(1-p_hat_B)/n_B)
ci_diff = (diff - z*se_diff, diff + z*se_diff)

print(f"A 95% CI: [{ci_A[0]:.4f}, {ci_A[1]:.4f}]")
print(f"B 95% CI: [{ci_B[0]:.4f}, {ci_B[1]:.4f}]")
print(f"diff 95% CI: [{ci_diff[0]:.4f}, {ci_diff[1]:.4f}]")

A 95% CI: [0.1065, 0.1243]
B 95% CI: [0.1259, 0.1449]
diff 95% CI: [0.0070, 0.0330]

Read itThe difference interval $[0.0070,\,0.0330]$ excludes $0$, so we are 95% confident B's true rate is higher — by somewhere between $0.7$ and $3.3$ percentage points. The single-arm intervals barely overlap, which already hints the effect is real.

3.4Two-proportion hypothesis test

The pooled-variance $z$-test, plus its $\chi^2$ and Welch two-sample-mean equivalents to show they agree.

cell 4 — z-test, χ² and t cross-checkPython · SciPy

# pooled proportion under H0: p_A == p_B
p_pool = (x_A + x_B) / (n_A + n_B)
se0    = np.sqrt(p_pool*(1-p_pool)*(1/n_A + 1/n_B))
z_stat = diff / se0
p_val  = 2 * (1 - norm.cdf(abs(z_stat)))      # two-sided
print(f"pooled p = {p_pool:.4f}   z = {z_stat:.3f}   p-value = {p_val:.4f}")

# equivalent chi-square test on the 2x2 contingency table
table = [[x_A, n_A - x_A], [x_B, n_B - x_B]]
chi2, p_chi, _, _ = stats.chi2_contingency(table, correction=False)
print(f"chi2 = {chi2:.3f}  (= z^2 = {z_stat**2:.3f})   p = {p_chi:.4f}")

pooled p = 0.1254 z = 3.020 p-value = 0.0025
chi2 = 9.118 (= z^2 = 9.118) p = 0.0025

Worked$z=3.02$ means the observed lift is just over three standard errors from zero. By the empirical rule that is deep in the tail — $p=0.0025 < 0.05$, so we reject $H_0$. Note $z^2=9.12$ reproduces the $\chi^2$ statistic exactly, confirming the two formulations are the same test.

3.5Power & required sample size — with a Monte-Carlo check

How much power did $n=5000$ buy us against the true effect, and how many users would 80% power need? Then we simulate the whole experiment 20,000 times to confirm the analytic power and the 5% false-positive rate — the LLN turning long-run frequencies into the probabilities the theory predicts.

cell 5 — power, sample size, Monte-Carlo validationPython · SciPy + Monte Carlo

alpha, target_power = 0.05, 0.80
za, zb = norm.ppf(1-alpha/2), norm.ppf(target_power)

def required_n(p1, p2):                  # per-arm n for target power
    pbar = (p1 + p2) / 2
    num = (za*np.sqrt(2*pbar*(1-pbar)) + zb*np.sqrt(p1*(1-p1)+p2*(1-p2)))**2
    return num / (p2 - p1)**2

def power_at(p1, p2, n):                  # analytic power at given n
    pbar = (p1 + p2) / 2
    se0  = np.sqrt(2*pbar*(1-pbar)/n)
    se1  = np.sqrt(p1*(1-p1)/n + p2*(1-p2)/n)
    d = abs(p2 - p1)
    return 1 - norm.cdf((za*se0 - d)/se1) + norm.cdf((-za*se0 - d)/se1)

n_req = np.ceil(required_n(p_A, p_B))
print(f"analytic power at n=5000 : {power_at(p_A, p_B, 5000):.3f}")
print(f"n per arm for 80% power  : {int(n_req)}")

# Monte Carlo: re-run the experiment many times, measure rejection rates
def reject_rate(pa, pb, n, reps=20000):
    sa = rng.binomial(n, pa, reps); sb = rng.binomial(n, pb, reps)
    pa_h, pb_h = sa/n, sb/n
    pp = (sa + sb) / (2*n)
    se = np.sqrt(pp*(1-pp)*(2/n))
    zz = (pb_h - pa_h) / se
    return (np.abs(zz) > za).mean()       # fraction rejected

print(f"MC power  (p_B=0.138) : {reject_rate(p_A, p_B, 5000):.3f}")
print(f"MC type-I (p_B=0.120) : {reject_rate(p_A, p_A, 5000):.3f}")

analytic power at n=5000 : 0.766
n per arm for 80% power : 5443
MC power (p_B=0.138) : 0.76x
MC type-I (p_B=0.120) : 0.05x

Read itAt $n=5000$ the test had only ~77% power for this effect — a real lift had roughly a 1-in-4 chance of being missed. To hit the conventional 80% we would need $\approx5443$ per arm. The Monte-Carlo rejection rate lands on the analytic power, and when B is set equal to A the rejection rate collapses to the $\alpha=0.05$ false-positive rate — the test is calibrated, and the LLN is doing the validating.

4 · Results & interpretation

The complete numerical picture from the run above, with the decision and its caveats.

Quantity	Arm A (control)	Arm B (treatment)	Notes
Visitors $n$	5000	5000	50/50 randomised
Conversions	577	677	observed counts
Rate $\hat p$	0.1154	0.1354	mean of indicators
Std error	0.0045	0.0048	$\sqrt{\hat p(1-\hat p)/n}$
95% CI	[0.1065, 0.1243]	[0.1259, 0.1449]	Wald, $z=1.96$
Absolute lift $\hat p_B-\hat p_A$	+0.0200 95% CI [0.0070, 0.0330]		excludes 0
Relative lift	+17.3%		vs control
Pooled rate (under $H_0$)	0.1254		$(S_A+S_B)/(n_A+n_B)$
Test statistic	$z=3.02$ · $\chi^2_1=9.12$		$z^2=\chi^2$
$p$-value (two-sided)	0.0025		$< 0.05$ → reject $H_0$
Power at $n=5000$	0.766		for true $\delta=0.018$
$n$ / arm for 80% power	5443		analytic
Monte-Carlo power	~0.76		20k simulated experiments
Monte-Carlo type-I rate	~0.05		matches $\alpha$ → calibrated

The decision

The redesigned checkout (B) converts at 13.5% versus 11.5% for the control — a +2.0 percentage-point, +17% relative lift. The 95%-confidence interval for the difference, $[0.7\%,\,3.3\%]$, sits entirely above zero, and the two-proportion test gives $z=3.02$, $p=0.0025$. We reject the null hypothesis of equal conversion rates: the evidence that B beats A is strong, and the plausible size of the win is "between a small and a sizeable improvement."

Honest caveat on power. The test was significant, but the design was slightly under-powered (~77% < the conventional 80%) for the true effect. The win was detected here, but a run like this would miss a genuine effect roughly one time in four. The lesson is to fix the sample size before the experiment from the smallest lift worth detecting — not to read power off the result after the fact. A pre-registered $n\approx5500$ per arm would have made the design properly powered.

CLT lensNone of the visitor-level data is normal — each visitor is a $0/1$ Bernoulli outcome. Yet every interval, $z$-value and power figure above leans on a normal approximation. That is the Central Limit Theorem earning its keep: the sample mean $\hat p$ of thousands of Bernoulli trials is very nearly normal, so the standardised difference behaves like a draw from $N(0,1)$ even though the raw data could not look less bell-shaped.

↳ demo: CLT — sampling distribution of the mean ↳ demo: normal & 68–95–99.7 ↳ demo: LLN

5 · Mapping to learning outcomes

Where each course thread shows up in the project — and the demo or session that introduces it.

T1 · DescribeConversion rate as the mean of $0/1$ indicators; standard error as a measure of spread of the estimate. S1–S3 · desc demo
T2 · ProbabilityRandomised assignment, independence of trials, and "$p$-value as a tail probability under $H_0$." S6–S8
T3 · Bernoulli/BinomialEach visitor is Bernoulli$(p)$; arm totals are Binomial$(n,p)$ with variance $np(1-p)$. S9–S12 · binomial demo
T4 · NormalStandardising the difference to a $z$-score and reading tail probabilities; the empirical rule behind "3 SEs is a lot." S17 · normal demo
T6 · Sampling/CLT$\mathrm{Var}(\hat p)=p(1-p)/n$, variances of independent arms add, and $\hat p$ is asymptotically normal — the engine of every CI and test here. S23–S26 · CLT demo
T6 · LLNThe Monte-Carlo rejection rates converge to the analytic power and to $\alpha$ as reps grow. S25–S26 · LLN demo
Py T1–T3NumPy arrays & vectorised summaries, SciPy norm/chi2_contingency, and Monte-Carlo validation with a seeded generator. S5, S19, S27

Generic skills demonstrated

Think analytically Think critically Make data-driven decisions Use statistical software

6 · Extensions

Natural next steps that build on the same simulation, each pointing toward a later BCSAI course.

ABayesian A/B testing

Put a $\mathrm{Beta}(1,1)$ (uniform) prior on each rate. Because the Beta is conjugate to the Binomial, the posterior is simply $\mathrm{Beta}(1+S_g,\,1+n_g-S_g)$ — no integration needed. Sample both posteriors and estimate $P(p_B>p_A)$ directly, a more intuitive answer than a $p$-value.

extension — Beta-Binomial posteriorPython · NumPy

post_A = rng.beta(1 + x_A, 1 + n_A - x_A, 200000)
post_B = rng.beta(1 + x_B, 1 + n_B - x_B, 200000)
print(f"P(p_B > p_A) = {(post_B > post_A).mean():.3f}")   # ~0.999
print(f"expected lift = {(post_B - post_A).mean():.4f}")

Connects toconditional probability and Bayes' theorem (T2) — the posterior is Bayes' rule applied to a whole distribution. See the conditional-probability demo.

BMultiple testing

Real experiments test many variants or many metrics at once. If you run $m$ independent tests at $\alpha=0.05$, the chance of at least one false positive is $1-(1-0.05)^m$ — already $0.40$ at $m=10$. Control it with a Bonferroni threshold $\alpha/m$, or the less conservative Benjamini–Hochberg false-discovery-rate procedure.

$P(\text{any false positive})=1-(1-\alpha)^m$ · Bonferroni: reject if $p_i<\alpha/m$.

Connects tothe complement and independence rules (T2): the family-wise error rate is just $1-P(\text{no false positives})$.

CA continuous metric & further ideas

Swap the binary outcome for revenue-per-visitor (right-skewed, positive) and compare arms with a Welch two-sample $t$-test via stats.ttest_ind(..., equal_var=False) — the same logic, now on means of a non-Bernoulli variable, again justified by the CLT. Other directions: sequential testing (peeking), stratification/CUPED for variance reduction, and bootstrapping the difference instead of assuming normality.

↳ demo: covariance & correlation ↳ demo: continuous PDF & CDF ↳ course: S27 Monte Carlo

7 · References

Course texts plus the tools used; section tags point back to the sessions this project draws on.

Compulsory
Devore, Jay L. (2016). Probability and Statistics for Engineering and the Sciences, 9th ed. Cengage Learning.
Ch.1 — descriptive stats · proportion as mean Ch.2 — probability & independence Ch.3 — Bernoulli/Binomial Ch.4 — normal & standardisation Ch.5 §5.3–5.5 — sampling distribution & CLT
Recommended
Sundnes, J. (2020). Introduction to Scientific Programming with Python. Springer Open.
NumPy arrays · S5 scipy.stats · S19 Monte Carlo · S27
Tools
NumPy (numpy.random.default_rng, binomial, beta) · SciPy (scipy.stats.norm, chi2_contingency, ttest_ind). All outputs on this page were produced with numpy.random.default_rng(42).
See also
This site's interactive demos and course outline; the official syllabus PDF.