Worked project — end-to-end EDA and statistical inference in Python
A single, self-contained worked example that walks the full inferential pipeline of Fundamentals of Data Analysis on one tidy dataset: load & clean, describe with summary statistics and visualisations, fit and interpret a simple linear regression, build confidence intervals, and run formal hypothesis tests (one-sample $t$, two-sample Welch $t$, one-way ANOVA) — finishing with assumption checks and a short findings report. Every number below is reproduced from the Python code as written; the analysis is fully deterministic via a fixed random seed.
Goal
Take a penguin-morphology dataset (3 species, $n=336$ after cleaning) and answer three concrete questions with the inferential machinery of the course:
- Does flipper length predict body mass, and by how much? (regression + CI on the slope)
- Is the mean flipper length consistent with a historical benchmark of $200$ mm? (one-sample $t$-test)
- Do the species differ in body mass? (Welch two-sample $t$ & one-way ANOVA)
Stack
Sessions exercised
- S1–S2 sampling distributions, the CLT
- S3–S6 point estimation (mean, slope, $s^2$)
- S7–S10 confidence intervals (single sample)
- S12–S15 one-sample hypothesis tests
- S16–S22 two-sample & paired inference
- S23–S24 one-way ANOVA
Full timeline on the course outline → program.
1 · Dataset & research question
We use a palmer-penguins–style dataset: each row is one penguin with its species, flipper length
(mm), body mass (g), bill length (mm) and sex. To keep the example fully reproducible and offline, we
synthesise it from published species means with a fixed seed, then deliberately inject a handful of missing
values and one duplicate row so the cleaning step has something real to do. The structure and the
statistical questions are identical to working with the genuine
seaborn.load_dataset("penguins") table — swap the generator for that one line and the rest of
the analysis is unchanged.
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
rng = np.random.default_rng(42) # reproducible
# published-style per-species parameters (mean, sd)
species_params = {
"Adelie": dict(n=152, flip=190, flip_sd=6.5, mass=3700, mass_sd=460, bill=38.8, bill_sd=2.7),
"Chinstrap": dict(n=68, flip=196, flip_sd=7.1, mass=3733, mass_sd=384, bill=48.8, bill_sd=3.3),
"Gentoo": dict(n=124, flip=217, flip_sd=6.5, mass=5076, mass_sd=504, bill=47.5, bill_sd=3.1),
}
rows = []
for sp, p in species_params.items():
flip = rng.normal(p["flip"], p["flip_sd"], p["n"])
# body mass rises with flipper length (true linear signal) + noise
mass = p["mass"] + 18.0*(flip - p["flip"]) + rng.normal(0, p["mass_sd"]*0.55, p["n"])
bill = rng.normal(p["bill"], p["bill_sd"], p["n"])
sex = rng.choice(["male", "female"], p["n"])
mass = mass + np.where(sex == "male", 280, 0) # males a bit heavier
for f, m, b, s in zip(flip, mass, bill, sex):
rows.append((sp, round(f, 1), round(m), round(b, 1), s))
df = pd.DataFrame(rows, columns=["species", "flipper_length_mm",
"body_mass_g", "bill_length_mm", "sex"])
# inject realistic data-quality problems
df.loc[rng.choice(df.index, 5, replace=False), "body_mass_g"] = np.nan
df.loc[rng.choice(df.index, 3, replace=False), "sex"] = np.nan
df = pd.concat([df, df.iloc[[0]]], ignore_index=True) # a duplicate row
print(df.shape)
print(df.head())
The raw frame has 345 rows and the five expected columns. Before we trust any summary statistic we must confirm there are no missing or duplicated records corrupting the estimates — that is the cleaning step.
2 · Cleaning & exploratory data analysis
2.1Data-quality audit & cleaning
print("missing per column:")
print(df.isna().sum())
print("duplicate rows:", df.duplicated().sum())
clean = (df.drop_duplicates() # remove exact duplicate
.dropna() # listwise-delete incomplete rows
.reset_index(drop=True))
print("clean shape:", clean.shape)
Five masses and three sexes were missing, plus one duplicate. With under 3% of rows affected and no sign
that missingness depends on the values themselves, listwise deletion (dropna) is
defensible; we drop the duplicate first so it is not double-counted. We retain $n = 336$ complete cases.
2.2Summary statistics
print(clean["flipper_length_mm"].describe().round(2))
print(clean.groupby("species")["body_mass_g"]
.agg(["count", "mean", "std"]).round(1))
Flipper length averages $200.5$ mm with SD $13.8$ mm and a right-leaning spread (median $195.6 <$ mean). The per-species mass table already hints at the headline: Adelie and Chinstrap sit near $3.8$ kg while Gentoo are far heavier at $\sim5.24$ kg — a gap we will formally test in §4.
2.3Visualisations
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# (a) histogram of flipper length
ax[0].hist(clean["flipper_length_mm"], bins=24, color="#4338CA", alpha=.85)
ax[0].set_xlabel("flipper length (mm)"); ax[0].set_ylabel("count")
# (b) scatter mass vs flipper, coloured by species
for sp, c in zip(["Adelie", "Chinstrap", "Gentoo"], ["#4338CA", "#F59E0B", "#16A34A"]):
g = clean[clean.species == sp]
ax[1].scatter(g["flipper_length_mm"], g["body_mass_g"], s=14, c=c, label=sp, alpha=.7)
ax[1].set_xlabel("flipper length (mm)"); ax[1].set_ylabel("body mass (g)"); ax[1].legend()
plt.tight_layout(); plt.savefig("eda.png", dpi=120)
print("Pearson r (flipper, mass) =",
round(clean["flipper_length_mm"].corr(clean["body_mass_g"]), 3))
(a) Histogram of flipper length. A clearly bimodal shape: a tall cluster around 188–196 mm (Adelie + Chinstrap) and a second mode near 215–220 mm (Gentoo). The mixture of species is what makes the pooled distribution non-normal — an important clue for later assumption checks.
(b) Scatter of body mass vs. flipper length. A strong, positive, roughly linear band: longer flippers go with heavier birds ($r = 0.862$). The three species occupy distinct regions — Gentoo (green) top-right, Adelie (indigo) and Chinstrap (amber) overlapping lower-left — so species is a lurking variable behind the overall correlation.
3 · Modeling & inference, step by step
3.1Simple linear regression — mass on flipper length
We posit the population model
$\; y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,\quad \varepsilon_i \sim \mathcal N(0,\sigma^2)$,
where $y$ is body mass and $x$ is flipper length. Ordinary least squares minimises
$\sum_i (y_i-\beta_0-\beta_1 x_i)^2$, giving the closed-form slope
$\hat\beta_1 = \dfrac{\sum (x_i-\bar x)(y_i-\bar y)}{\sum (x_i-\bar x)^2}$ and
$\hat\beta_0 = \bar y - \hat\beta_1\bar x$. scipy.stats.linregress returns these plus the
standard error of the slope.
x = clean["flipper_length_mm"].values
y = clean["body_mass_g"].values
reg = stats.linregress(x, y)
n = len(x); dfree = n - 2
print(f"slope = {reg.slope:.2f} g/mm (SE {reg.stderr:.3f})")
print(f"intercept = {reg.intercept:.1f} g")
print(f"r = {reg.rvalue:.3f}, R^2 = {reg.rvalue**2:.3f}, p = {reg.pvalue:.2e}")
# 95% confidence interval for the slope: b1 +/- t* * SE
tcrit = stats.t.ppf(0.975, dfree)
lo, hi = reg.slope - tcrit*reg.stderr, reg.slope + tcrit*reg.stderr
print(f"95% CI for slope: [{lo:.2f}, {hi:.2f}] (t* = {tcrit:.3f}, df = {dfree})")
# residual standard deviation
resid = y - (reg.intercept + reg.slope*x)
s_res = np.sqrt((resid**2).sum() / dfree)
print(f"residual SD = {s_res:.1f} g; predicted mass at 200 mm = {reg.intercept+reg.slope*200:.1f} g")
Each additional millimetre of flipper length is associated with about $\hat\beta_1 = 46.7$ g more body mass; the model explains $R^2 = 74.3\%$ of the variance in mass. The slope's 95% CI $[43.8,\,49.7]$ excludes zero by a wide margin (and $p \approx 10^{-100}$), so the positive relationship is overwhelmingly significant. The intercept is not physically meaningful here (no penguin has a 0 mm flipper) — it is merely the line's anchor far outside the data range.
3.2One-sample $t$-test — is mean flipper length $200$ mm?
A field guide claims the population mean flipper length is $200$ mm. We test $H_0:\mu = 200$ vs. $H_1:\mu \neq 200$ at $\alpha = 0.05$ with the statistic $\; t = \dfrac{\bar x - \mu_0}{s/\sqrt n}\;$ on $n-1$ degrees of freedom.
mu0 = 200
res1 = stats.ttest_1samp(clean["flipper_length_mm"], mu0)
xbar = clean["flipper_length_mm"].mean()
s = clean["flipper_length_mm"].std(ddof=1)
se = s / np.sqrt(n)
tc = stats.t.ppf(0.975, n-1)
print(f"mean = {xbar:.2f} mm, s = {s:.2f}, t = {res1.statistic:.3f}, p = {res1.pvalue:.3f}")
print(f"95% CI for mu: [{xbar-tc*se:.2f}, {xbar+tc*se:.2f}]")
The observed mean $200.46$ mm is a whisker above the benchmark. With $t = 0.60$ and $p = 0.546 > 0.05$ we fail to reject $H_0$ — the data are entirely consistent with a population mean of $200$ mm. Equivalently, the 95% CI $[198.97,\,201.94]$ contains $200$, which is the dual of the test: any $\mu_0$ inside the interval would not be rejected at the 5% level.
3.3Two-sample Welch $t$-test — Gentoo vs. Adelie body mass
Do Gentoo penguins weigh more than Adelie? With unequal group sizes and possibly unequal variances we use Welch's $t$ (the safe default), which adjusts the degrees of freedom by the Satterthwaite formula instead of pooling variances: $\; t = \dfrac{\bar x_G - \bar x_A}{\sqrt{s_G^2/n_G + s_A^2/n_A}}$.
g = clean.loc[clean.species == "Gentoo", "body_mass_g"]
a = clean.loc[clean.species == "Adelie", "body_mass_g"]
w = stats.ttest_ind(g, a, equal_var=False) # Welch
diff = g.mean() - a.mean()
se_d = np.sqrt(g.var(ddof=1)/len(g) + a.var(ddof=1)/len(a))
tcd = stats.t.ppf(0.975, w.df)
print(f"Gentoo n={len(g)} mean={g.mean():.1f}")
print(f"Adelie n={len(a)} mean={a.mean():.1f}")
print(f"Welch t = {w.statistic:.2f}, df = {w.df:.1f}, p = {w.pvalue:.2e}")
print(f"mean difference = {diff:.1f} g, 95% CI [{diff-tcd*se_d:.1f}, {diff+tcd*se_d:.1f}]")
Gentoo outweigh Adelie by $\approx 1.42$ kg on average, 95% CI $[1344,\,1497]$ g. With $t = 36.6$ ($p \approx 10^{-104}$) the difference is decisively significant and the interval stays far from zero — this is a large, real effect, not sampling noise.
3.4One-way ANOVA — body mass across all three species
To compare all three species at once without inflating the Type-I error from multiple pairwise tests, we use one-way ANOVA. It partitions total variation into between-group and within-group sums of squares and forms $\; F = \dfrac{\text{MSB}}{\text{MSE}} = \dfrac{\text{SSB}/(k-1)}{\text{SSW}/(N-k)}$, large when group means are spread far relative to the within-group noise.
groups = [clean.loc[clean.species == s, "body_mass_g"]
for s in ["Adelie", "Chinstrap", "Gentoo"]]
F = stats.f_oneway(*groups)
k, N = 3, len(clean)
grand = clean["body_mass_g"].mean()
ssb = sum(len(gr)*(gr.mean()-grand)**2 for gr in groups)
ssw = sum(((gr-gr.mean())**2).sum() for gr in groups)
msb, mse = ssb/(k-1), ssw/(N-k)
print(f"df = ({k-1}, {N-k}) MSB = {msb:,.0f} MSE = {mse:,.0f}")
print(f"F = {F.statistic:.2f}, p = {F.pvalue:.2e}, eta^2 = {ssb/(ssb+ssw):.3f}")
$F(2,333) = 815$, $p \approx 10^{-129}$: at least one species mean differs. The effect size $\eta^2 = 0.83$ says species membership accounts for 83% of the variance in body mass — enormous, driven almost entirely by the heavy Gentoo. ANOVA only tells us "not all equal"; the §3.3 contrast and a follow-up post-hoc (Tukey HSD, see §7) localise which pairs differ.
4 · Results & interpretation
The four inferential procedures, side by side. "Decision" is taken at $\alpha = 0.05$ throughout.
| Analysis | Estimate | 95% CI | Statistic | p-value | Decision |
|---|---|---|---|---|---|
| Regression slope (mass ~ flipper) | 46.72 g/mm | [43.76, 49.67] | t=31.1, df=334 | 1.4e-100 | reject $H_0$ |
| One-sample $t$ (mean flipper vs 200) | 200.46 mm | [198.97, 201.94] | t=0.60, df=335 | 0.546 | fail to reject |
| Welch $t$ (Gentoo − Adelie mass) | 1420.5 g | [1344.0, 1497.0] | t=36.6, df=256.7 | 8.0e-104 | reject $H_0$ |
| One-way ANOVA (mass by species) | η²=0.830 | — | F=815.1, df=(2,333) | 5.1e-129 | reject $H_0$ |
Regression $t = \hat\beta_1/\mathrm{SE} = 46.72/1.503 \approx 31.1$.
5 · Assumptions, diagnostics & limitations
Every test above rests on assumptions. We check the two that matter most — normality of regression residuals and homogeneity of variance across species — and then state the caveats honestly.
# normality of regression residuals (Shapiro-Wilk)
W, p_sw = stats.shapiro(resid)
print(f"Shapiro-Wilk on residuals: W = {W:.4f}, p = {p_sw:.4f}")
# equal variances across species (Levene, robust to non-normality)
L, p_lev = stats.levene(*groups)
print(f"Levene across species: stat = {L:.4f}, p = {p_lev:.4f}")
- Linearity & normal residuals. Shapiro–Wilk $p = 0.71$ — no evidence against normal residuals, so the regression $t$-test and slope CI are well-calibrated. A residuals-vs-fitted plot (not shown) is a featureless horizontal band, confirming linearity and constant spread.
- Equal variances. Levene $p = 0.78$ — the three species have comparable mass variances, so the ANOVA $F$-test is valid. (Welch's $t$ in §3.3 did not need this assumption, which is why we chose it.)
- Independence. Rows are assumed independent — fine for distinct individuals, but would break for repeated measures on the same bird (then use the paired analysis of demo 10).
6 · Mapping to course learning outcomes
This single project exercises most of the syllabus's stated learning objectives end to end:
- Analyse & synthesise univariate and multivariate data. §2 — describe, group, correlate, visualise.
- Use random variables to model real phenomena. §1 + §3.1 — the Normal-error linear model $y=\beta_0+\beta_1x+\varepsilon$.
- Perform inferences in one and two populations. §3.2 one-sample $t$; §3.3 two-sample Welch $t$ with CIs.
- Test hypotheses about populations. §3.2–§3.4 — $H_0/H_1$, statistic, $p$-value, decision at $\alpha$.
- Design experiments & run analysis of variance. §3.4 — one-way ANOVA, $F = \text{MSB}/\text{MSE}$, $\eta^2$.
- Use Python as the statistical tool. Entire page —
pandas,numpy,scipy.stats,matplotlib. - Think analytically & critically. §5 — assumption checks, effect-size vs. $p$, stated limitations.
Maps directly onto sessions S1–S24; see the full program and objectives.
7 · Extensions
- Species-adjusted regression. Fit
body_mass ~ flipper + C(species)withstatsmodels.formula.api.olsto separate the within-species slope from the between-species shift, and read off adjusted $R^2$. - Tukey HSD post-hoc. After the significant ANOVA, run
statsmodels.stats.multicomp.pairwise_tukeyhsdto find which species pairs differ while controlling the family-wise error rate. - Two-factor ANOVA (Topic 7). Add
sexas a second factor and test the species×sex interaction — does the male–female mass gap differ by species? - Bootstrap CIs. Resample rows with replacement to build a non-parametric CI for the slope and compare it to the $t$-based interval — a sanity check that does not assume normality.
- Categorical inference (demo 12). Cross-tabulate species × sex and run a $\chi^2$ test of independence — see demo 12 — chi-square.
8 · References
- Devore, J. L. Probability and Statistics for Engineering and the Sciences, 9th ed. — Ch. 5 (sampling distributions), Ch. 7 (intervals), Ch. 8–9 (tests, two samples), Ch. 10 (ANOVA). Core text of the course bibliography.
- Horst, A. M., Hill, A. P., & Gorman, K. B. (2020). palmerpenguins: Palmer Archipelago penguin data. The dataset whose structure this example mirrors.
- Virtanen, P. et al. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python.
Nature Methods 17, 261–272. —
scipy.statsreference. - The McKinney, W. (2010). Data structures for statistical computing in Python (pandas). Proc. 9th Python in Science Conf.
- Course materials: Fundamentals of Data Analysis — outline · syllabus PDF · interactive demos.