data-analysis-lab worked project · EDA → regression → inference

Worked project — end-to-end EDA and statistical inference in Python

A single, self-contained worked example that walks the full inferential pipeline of Fundamentals of Data Analysis on one tidy dataset: load & clean, describe with summary statistics and visualisations, fit and interpret a simple linear regression, build confidence intervals, and run formal hypothesis tests (one-sample $t$, two-sample Welch $t$, one-way ANOVA) — finishing with assumption checks and a short findings report. Every number below is reproduced from the Python code as written; the analysis is fully deterministic via a fixed random seed.

Goal

Take a penguin-morphology dataset (3 species, $n=336$ after cleaning) and answer three concrete questions with the inferential machinery of the course:

  1. Does flipper length predict body mass, and by how much? (regression + CI on the slope)
  2. Is the mean flipper length consistent with a historical benchmark of $200$ mm? (one-sample $t$-test)
  3. Do the species differ in body mass? (Welch two-sample $t$ & one-way ANOVA)

Stack

Python 3pandasnumpy scipy.statsmatplotlib

Sessions exercised

  • S1–S2 sampling distributions, the CLT
  • S3–S6 point estimation (mean, slope, $s^2$)
  • S7–S10 confidence intervals (single sample)
  • S12–S15 one-sample hypothesis tests
  • S16–S22 two-sample & paired inference
  • S23–S24 one-way ANOVA

Full timeline on the course outline → program.

How to read this page. Each step shows the real Python first, then its console output in a monospaced box, then a one-paragraph interpretation. Plots are described rather than embedded so the page stays a single static file; the plotting code that produces them is included verbatim. The matching live, draggable versions of each concept live in the interactive demos.

1 · Dataset & research question

We use a palmer-penguins–style dataset: each row is one penguin with its species, flipper length (mm), body mass (g), bill length (mm) and sex. To keep the example fully reproducible and offline, we synthesise it from published species means with a fixed seed, then deliberately inject a handful of missing values and one duplicate row so the cleaning step has something real to do. The structure and the statistical questions are identical to working with the genuine seaborn.load_dataset("penguins") table — swap the generator for that one line and the rest of the analysis is unchanged.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)            # reproducible

# published-style per-species parameters (mean, sd)
species_params = {
    "Adelie":    dict(n=152, flip=190, flip_sd=6.5, mass=3700, mass_sd=460, bill=38.8, bill_sd=2.7),
    "Chinstrap": dict(n=68,  flip=196, flip_sd=7.1, mass=3733, mass_sd=384, bill=48.8, bill_sd=3.3),
    "Gentoo":    dict(n=124, flip=217, flip_sd=6.5, mass=5076, mass_sd=504, bill=47.5, bill_sd=3.1),
}

rows = []
for sp, p in species_params.items():
    flip = rng.normal(p["flip"], p["flip_sd"], p["n"])
    # body mass rises with flipper length (true linear signal) + noise
    mass = p["mass"] + 18.0*(flip - p["flip"]) + rng.normal(0, p["mass_sd"]*0.55, p["n"])
    bill = rng.normal(p["bill"], p["bill_sd"], p["n"])
    sex  = rng.choice(["male", "female"], p["n"])
    mass = mass + np.where(sex == "male", 280, 0)      # males a bit heavier
    for f, m, b, s in zip(flip, mass, bill, sex):
        rows.append((sp, round(f, 1), round(m), round(b, 1), s))

df = pd.DataFrame(rows, columns=["species", "flipper_length_mm",
                                 "body_mass_g", "bill_length_mm", "sex"])

# inject realistic data-quality problems
df.loc[rng.choice(df.index, 5, replace=False), "body_mass_g"] = np.nan
df.loc[rng.choice(df.index, 3, replace=False), "sex"] = np.nan
df = pd.concat([df, df.iloc[[0]]], ignore_index=True)   # a duplicate row

print(df.shape)
print(df.head())
(345, 5) species flipper_length_mm body_mass_g bill_length_mm sex 0 Adelie 192.4 3893.0 36.7 male 1 Adelie 184.1 3265.0 41.2 female 2 Adelie 188.0 3624.0 37.9 male 3 Adelie 196.6 4263.0 38.0 male 4 Adelie 183.2 3279.0 42.1 female

The raw frame has 345 rows and the five expected columns. Before we trust any summary statistic we must confirm there are no missing or duplicated records corrupting the estimates — that is the cleaning step.

2 · Cleaning & exploratory data analysis

2.1Data-quality audit & cleaning

print("missing per column:")
print(df.isna().sum())
print("duplicate rows:", df.duplicated().sum())

clean = (df.drop_duplicates()      # remove exact duplicate
           .dropna()               # listwise-delete incomplete rows
           .reset_index(drop=True))
print("clean shape:", clean.shape)
missing per column: species 0 flipper_length_mm 0 body_mass_g 5 bill_length_mm 0 sex 3 dtype: int64 duplicate rows: 1 clean shape: (336, 5)

Five masses and three sexes were missing, plus one duplicate. With under 3% of rows affected and no sign that missingness depends on the values themselves, listwise deletion (dropna) is defensible; we drop the duplicate first so it is not double-counted. We retain $n = 336$ complete cases.

2.2Summary statistics

print(clean["flipper_length_mm"].describe().round(2))
print(clean.groupby("species")["body_mass_g"]
            .agg(["count", "mean", "std"]).round(1))
count 336.00 mean 200.46 std 13.83 min 176.10 25% 189.00 50% 195.55 75% 213.82 max 237.70 Name: flipper_length_mm, dtype: float64 count mean std species Adelie 146 3819.8 312.4 Chinstrap 67 3835.8 280.0 Gentoo 123 5240.3 321.3

Flipper length averages $200.5$ mm with SD $13.8$ mm and a right-leaning spread (median $195.6 <$ mean). The per-species mass table already hints at the headline: Adelie and Chinstrap sit near $3.8$ kg while Gentoo are far heavier at $\sim5.24$ kg — a gap we will formally test in §4.

2.3Visualisations

fig, ax = plt.subplots(1, 2, figsize=(10, 4))

# (a) histogram of flipper length
ax[0].hist(clean["flipper_length_mm"], bins=24, color="#4338CA", alpha=.85)
ax[0].set_xlabel("flipper length (mm)"); ax[0].set_ylabel("count")

# (b) scatter mass vs flipper, coloured by species
for sp, c in zip(["Adelie", "Chinstrap", "Gentoo"], ["#4338CA", "#F59E0B", "#16A34A"]):
    g = clean[clean.species == sp]
    ax[1].scatter(g["flipper_length_mm"], g["body_mass_g"], s=14, c=c, label=sp, alpha=.7)
ax[1].set_xlabel("flipper length (mm)"); ax[1].set_ylabel("body mass (g)"); ax[1].legend()
plt.tight_layout(); plt.savefig("eda.png", dpi=120)

print("Pearson r (flipper, mass) =",
      round(clean["flipper_length_mm"].corr(clean["body_mass_g"]), 3))
Pearson r (flipper, mass) = 0.862
Figure 1 — described

(a) Histogram of flipper length. A clearly bimodal shape: a tall cluster around 188–196 mm (Adelie + Chinstrap) and a second mode near 215–220 mm (Gentoo). The mixture of species is what makes the pooled distribution non-normal — an important clue for later assumption checks.

(b) Scatter of body mass vs. flipper length. A strong, positive, roughly linear band: longer flippers go with heavier birds ($r = 0.862$). The three species occupy distinct regions — Gentoo (green) top-right, Adelie (indigo) and Chinstrap (amber) overlapping lower-left — so species is a lurking variable behind the overall correlation.

Cross-link. The shape of a sampling distribution and the meaning of $r$-driven scatter are explored interactively in demo 1 — CLT.

3 · Modeling & inference, step by step

3.1Simple linear regression — mass on flipper length

We posit the population model $\; y_i = \beta_0 + \beta_1 x_i + \varepsilon_i,\quad \varepsilon_i \sim \mathcal N(0,\sigma^2)$, where $y$ is body mass and $x$ is flipper length. Ordinary least squares minimises $\sum_i (y_i-\beta_0-\beta_1 x_i)^2$, giving the closed-form slope $\hat\beta_1 = \dfrac{\sum (x_i-\bar x)(y_i-\bar y)}{\sum (x_i-\bar x)^2}$ and $\hat\beta_0 = \bar y - \hat\beta_1\bar x$. scipy.stats.linregress returns these plus the standard error of the slope.

x = clean["flipper_length_mm"].values
y = clean["body_mass_g"].values
reg = stats.linregress(x, y)

n  = len(x); dfree = n - 2
print(f"slope     = {reg.slope:.2f} g/mm  (SE {reg.stderr:.3f})")
print(f"intercept = {reg.intercept:.1f} g")
print(f"r = {reg.rvalue:.3f},  R^2 = {reg.rvalue**2:.3f},  p = {reg.pvalue:.2e}")

# 95% confidence interval for the slope: b1 +/- t* * SE
tcrit = stats.t.ppf(0.975, dfree)
lo, hi = reg.slope - tcrit*reg.stderr, reg.slope + tcrit*reg.stderr
print(f"95% CI for slope: [{lo:.2f}, {hi:.2f}]   (t* = {tcrit:.3f}, df = {dfree})")

# residual standard deviation
resid = y - (reg.intercept + reg.slope*x)
s_res = np.sqrt((resid**2).sum() / dfree)
print(f"residual SD = {s_res:.1f} g;  predicted mass at 200 mm = {reg.intercept+reg.slope*200:.1f} g")
slope = 46.72 g/mm (SE 1.503) intercept = -5021.6 g r = 0.862, R^2 = 0.743, p = 1.36e-100 95% CI for slope: [43.76, 49.67] (t* = 1.967, df = 334) residual SD = 380.4 g; predicted mass at 200 mm = 4321.7 g

Each additional millimetre of flipper length is associated with about $\hat\beta_1 = 46.7$ g more body mass; the model explains $R^2 = 74.3\%$ of the variance in mass. The slope's 95% CI $[43.8,\,49.7]$ excludes zero by a wide margin (and $p \approx 10^{-100}$), so the positive relationship is overwhelmingly significant. The intercept is not physically meaningful here (no penguin has a 0 mm flipper) — it is merely the line's anchor far outside the data range.

Cross-link. The width of this slope CI shrinking like $1/\sqrt n$ is the same mechanism in demo 5 — CI width; the meaning of "95% confident" is in demo 4 — CI coverage.

3.2One-sample $t$-test — is mean flipper length $200$ mm?

A field guide claims the population mean flipper length is $200$ mm. We test $H_0:\mu = 200$ vs. $H_1:\mu \neq 200$ at $\alpha = 0.05$ with the statistic $\; t = \dfrac{\bar x - \mu_0}{s/\sqrt n}\;$ on $n-1$ degrees of freedom.

mu0  = 200
res1 = stats.ttest_1samp(clean["flipper_length_mm"], mu0)
xbar = clean["flipper_length_mm"].mean()
s    = clean["flipper_length_mm"].std(ddof=1)
se   = s / np.sqrt(n)
tc   = stats.t.ppf(0.975, n-1)
print(f"mean = {xbar:.2f} mm,  s = {s:.2f},  t = {res1.statistic:.3f},  p = {res1.pvalue:.3f}")
print(f"95% CI for mu: [{xbar-tc*se:.2f}, {xbar+tc*se:.2f}]")
mean = 200.46 mm, s = 13.83, t = 0.604, p = 0.546 95% CI for mu: [198.97, 201.94]

The observed mean $200.46$ mm is a whisker above the benchmark. With $t = 0.60$ and $p = 0.546 > 0.05$ we fail to reject $H_0$ — the data are entirely consistent with a population mean of $200$ mm. Equivalently, the 95% CI $[198.97,\,201.94]$ contains $200$, which is the dual of the test: any $\mu_0$ inside the interval would not be rejected at the 5% level.

Cross-link. Drag $\bar x-\mu_0$ and watch the decision flip in demo 6 — one-sample test; the $p$-value as a tail area is demo 7.

3.3Two-sample Welch $t$-test — Gentoo vs. Adelie body mass

Do Gentoo penguins weigh more than Adelie? With unequal group sizes and possibly unequal variances we use Welch's $t$ (the safe default), which adjusts the degrees of freedom by the Satterthwaite formula instead of pooling variances: $\; t = \dfrac{\bar x_G - \bar x_A}{\sqrt{s_G^2/n_G + s_A^2/n_A}}$.

g = clean.loc[clean.species == "Gentoo", "body_mass_g"]
a = clean.loc[clean.species == "Adelie", "body_mass_g"]
w = stats.ttest_ind(g, a, equal_var=False)        # Welch

diff = g.mean() - a.mean()
se_d = np.sqrt(g.var(ddof=1)/len(g) + a.var(ddof=1)/len(a))
tcd  = stats.t.ppf(0.975, w.df)
print(f"Gentoo  n={len(g)}  mean={g.mean():.1f}")
print(f"Adelie  n={len(a)}  mean={a.mean():.1f}")
print(f"Welch t = {w.statistic:.2f},  df = {w.df:.1f},  p = {w.pvalue:.2e}")
print(f"mean difference = {diff:.1f} g,  95% CI [{diff-tcd*se_d:.1f}, {diff+tcd*se_d:.1f}]")
Gentoo n=123 mean=5240.3 Adelie n=146 mean=3819.8 Welch t = 36.59, df = 256.7, p = 7.96e-104 mean difference = 1420.5 g, 95% CI [1344.0, 1497.0]

Gentoo outweigh Adelie by $\approx 1.42$ kg on average, 95% CI $[1344,\,1497]$ g. With $t = 36.6$ ($p \approx 10^{-104}$) the difference is decisively significant and the interval stays far from zero — this is a large, real effect, not sampling noise.

Cross-link. Pooled vs. Welch behaviour under unequal variances is demo 9 — two-sample $t$; the paired alternative is demo 10.

3.4One-way ANOVA — body mass across all three species

To compare all three species at once without inflating the Type-I error from multiple pairwise tests, we use one-way ANOVA. It partitions total variation into between-group and within-group sums of squares and forms $\; F = \dfrac{\text{MSB}}{\text{MSE}} = \dfrac{\text{SSB}/(k-1)}{\text{SSW}/(N-k)}$, large when group means are spread far relative to the within-group noise.

groups = [clean.loc[clean.species == s, "body_mass_g"]
          for s in ["Adelie", "Chinstrap", "Gentoo"]]
F = stats.f_oneway(*groups)

k, N  = 3, len(clean)
grand = clean["body_mass_g"].mean()
ssb = sum(len(gr)*(gr.mean()-grand)**2 for gr in groups)
ssw = sum(((gr-gr.mean())**2).sum() for gr in groups)
msb, mse = ssb/(k-1), ssw/(N-k)
print(f"df = ({k-1}, {N-k})   MSB = {msb:,.0f}   MSE = {mse:,.0f}")
print(f"F = {F.statistic:.2f},  p = {F.pvalue:.2e},  eta^2 = {ssb/(ssb+ssw):.3f}")
df = (2, 333) MSB = 78,118,240 MSE = 95,841 F = 815.08, p = 5.13e-129, eta^2 = 0.830

$F(2,333) = 815$, $p \approx 10^{-129}$: at least one species mean differs. The effect size $\eta^2 = 0.83$ says species membership accounts for 83% of the variance in body mass — enormous, driven almost entirely by the heavy Gentoo. ANOVA only tells us "not all equal"; the §3.3 contrast and a follow-up post-hoc (Tukey HSD, see §7) localise which pairs differ.

Cross-link. Move group means apart and watch $F$ react in demo 11 — one-way ANOVA.

4 · Results & interpretation

The four inferential procedures, side by side. "Decision" is taken at $\alpha = 0.05$ throughout.

AnalysisEstimate95% CIStatisticp-valueDecision
Regression slope (mass ~ flipper) 46.72 g/mm [43.76, 49.67] t=31.1, df=334 1.4e-100 reject $H_0$
One-sample $t$ (mean flipper vs 200) 200.46 mm [198.97, 201.94] t=0.60, df=335 0.546 fail to reject
Welch $t$ (Gentoo − Adelie mass) 1420.5 g [1344.0, 1497.0] t=36.6, df=256.7 8.0e-104 reject $H_0$
One-way ANOVA (mass by species) η²=0.830 F=815.1, df=(2,333) 5.1e-129 reject $H_0$

Regression $t = \hat\beta_1/\mathrm{SE} = 46.72/1.503 \approx 31.1$.

Findings report (plain language). In this sample of 336 penguins, flipper length is a strong linear predictor of body mass: every extra millimetre of flipper corresponds to about 47 g more mass, and flipper length alone explains roughly three-quarters of the variation in mass. The average flipper length (200.5 mm) is statistically indistinguishable from the historical 200 mm benchmark. Body mass differs sharply across species: Gentoo penguins are about 1.4 kg heavier than Adelie, and species membership explains 83% of the mass variance overall. Practically, if you can measure a penguin's flipper you can estimate its mass to within roughly $\pm 380$ g (one residual SD), and species is the single most informative grouping variable in the dataset.

5 · Assumptions, diagnostics & limitations

Every test above rests on assumptions. We check the two that matter most — normality of regression residuals and homogeneity of variance across species — and then state the caveats honestly.

# normality of regression residuals (Shapiro-Wilk)
W, p_sw = stats.shapiro(resid)
print(f"Shapiro-Wilk on residuals: W = {W:.4f},  p = {p_sw:.4f}")

# equal variances across species (Levene, robust to non-normality)
L, p_lev = stats.levene(*groups)
print(f"Levene across species:     stat = {L:.4f},  p = {p_lev:.4f}")
Shapiro-Wilk on residuals: W = 0.9966, p = 0.7085 Levene across species: stat = 0.2456, p = 0.7824
Limitations. (1) The data are synthetic and Gaussian by construction; real penguin data are mildly skewed and the pooled flipper histogram is bimodal, so pooled normality should not be over-trusted. (2) The flipper→mass regression ignores species; a species-adjusted model (§7) would shrink the slope because part of the raw correlation is the between-species mean difference (Simpson-style confounding). (3) Listwise deletion is safe only under missing-completely-at-random; with informative missingness, imputation would be preferable. (4) $p$-values this extreme reflect both a true effect and a large $n$ — always report effect sizes (slope, mean difference, $\eta^2$) alongside them, as we do in §4.

6 · Mapping to course learning outcomes

This single project exercises most of the syllabus's stated learning objectives end to end:

Maps directly onto sessions S1–S24; see the full program and objectives.

7 · Extensions

8 · References