ml-lab worked example · end-to-end supervised ML pipeline

Worked example — predicting customer churn end-to-end

A single, complete supervised-learning project carried from raw data to an evaluated, tuned classifier. We predict which telecom customers will churn (cancel their subscription) — a canonical binary-classification business problem. Every step below is real, runnable Python (scikit-learn, pandas, numpy); the underlying mathematics is written out in KaTeX so the code and the theory line up.

Goal

Given a customer's contract, billing and usage attributes, output a calibrated probability $\hat p = P(\text{churn}=1 \mid \mathbf{x})$ and a class decision. The business cares more about recall on churners (catching at-risk customers before they leave) than raw accuracy, so we evaluate with the full confusion matrix, precision–recall and ROC-AUC rather than accuracy alone.

Stack

Language
Python 3.11
Core libs
scikit-learn · pandas · numpy
Task
Binary classification
Models
Baseline · Logistic · Tree · Random Forest
Validation
Stratified 5-fold CV
Tuning
GridSearchCV

Sessions exercised

This project deliberately threads together most of the course's first half. It maps onto the syllabus as follows (see the program):

S3 · data prep S4 · feature engineering S5 · fundamental algorithms S6 · supervised training S7 · first end-to-end practice S9 · imbalanced classes S11 · pipelines & hyper-tuning S13 · model evaluation & CV S14 · performance metrics

It is the concrete companion to session 7's “A First Basic Practice — end to end practical case, from EDA to model training and validation” and session 12's “A First Practical Overview of Typical Applications.”

1 · Dataset & problem framing

We use the well-known Telco Customer Churn table (7,043 rows, ~20 features): a mix of categorical attributes (Contract, InternetService, PaymentMethod, …) and numeric ones (tenure, MonthlyCharges, TotalCharges). The target is Churn ∈ {Yes, No}, which we encode as $y \in \{1, 0\}$. The code reads from a local CSV; if it is absent it synthesises a structurally identical frame so the notebook always runs.

Class imbalance. Roughly 27% of customers churn. A model that always predicts “no churn” would already score ~73% accuracy while catching zero churners — the first reason accuracy is the wrong headline metric here.
import numpy as np
import pandas as pd
from pathlib import Path

RANDOM_STATE = 42

def load_churn(path="telco_churn.csv"):
    """Load the Telco churn table, or synthesise a stand-in with the same schema."""
    if Path(path).exists():
        df = pd.read_csv(path)
    else:
        rng = np.random.default_rng(RANDOM_STATE)
        n = 7043
        tenure = rng.integers(0, 73, n)
        monthly = rng.uniform(18, 119, n).round(2)
        contract = rng.choice(["Month-to-month", "One year", "Two year"],
                              size=n, p=[0.55, 0.21, 0.24])
        internet = rng.choice(["DSL", "Fiber optic", "No"], size=n, p=[0.34, 0.44, 0.22])
        # churn risk rises for short tenure, high bill, month-to-month, fiber
        logit = (-2.4
                 - 0.045 * tenure
                 + 0.018 * monthly
                 + (contract == "Month-to-month") * 1.6
                 + (internet == "Fiber optic") * 0.7)
        p = 1 / (1 + np.exp(-logit))
        churn = (rng.random(n) < p).astype(int)
        df = pd.DataFrame({
            "tenure": tenure,
            "MonthlyCharges": monthly,
            "TotalCharges": (tenure * monthly).round(2),
            "Contract": contract,
            "InternetService": internet,
            "PaymentMethod": rng.choice(
                ["Electronic check", "Mailed check",
                 "Bank transfer", "Credit card"], size=n),
            "Churn": np.where(churn == 1, "Yes", "No"),
        })
    return df

df = load_churn()
# TotalCharges sometimes arrives as blank strings in the real file:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.dropna(subset=["TotalCharges"]).reset_index(drop=True)

y = (df["Churn"] == "Yes").astype(int)
X = df.drop(columns=["Churn"])
print(X.shape, "churn rate:", round(y.mean(), 3))

# (7043, 6) churn rate: ~0.27

Why this framing

2 · Pipeline design — split, leakage, scaling

Train / test split

We hold out 20% of the data as a final, untouched test set, and stratify on $y$ so the churn rate is identical in both folds. The held-out set is touched exactly once, at the very end. (See the interactive train/test split & overfitting demo.)

Avoiding data leakage

The single most common project-killing bug: fitting any transformer (scaler, encoder, imputer) on the whole dataset before splitting. The test set's statistics then leak into training and inflate the score. The fix is to wrap every transform inside a Pipeline / ColumnTransformer so that .fit() only ever sees training rows, and cross-validation re-fits the transforms inside each fold.

Rule. Fit on train, transform on both. Standardisation uses $z = \dfrac{x - \mu_{\text{train}}}{\sigma_{\text{train}}}$ where $\mu_{\text{train}}, \sigma_{\text{train}}$ come from the training fold only — never the global mean. (See the feature scaling demo.)

Feature scaling & encoding

Logistic regression and distance-based models need comparable feature ranges, so numerics are standardised. Tree ensembles are scale-invariant, but we keep one preprocessor for all models so the comparison is apples-to-apples. Categoricals are one-hot encoded with handle_unknown="ignore" so unseen categories at test time do not crash.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
cat_cols = ["Contract", "InternetService", "PaymentMethod"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=RANDOM_STATE)

numeric = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),
])
categorical = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer([
    ("num", numeric, num_cols),
    ("cat", categorical, cat_cols),
])  # fit happens ONLY on training data, inside the pipeline

# preprocess is now a reusable, leakage-safe first stage for every model

3 · Step-by-step implementation

3.1 Baseline — always predict the majority class

Before any real model, establish the floor. A DummyClassifier that always predicts “no churn” fixes the number every later model must beat. If a fancy model can't beat the baseline, the features carry no signal.

from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score

baseline = DummyClassifier(strategy="most_frequent").fit(X_train, y_train)
pred = baseline.predict(X_test)
print("baseline accuracy:", round(accuracy_score(y_test, pred), 3))  # ~0.73
print("baseline F1 (churn):", round(f1_score(y_test, pred), 3))      # 0.0 — catches no churners

3.2 Logistic regression — the linear probabilistic model

Logistic regression models the log-odds as a linear function of the features and squashes it through the sigmoid:

$$\hat p = \sigma(\mathbf{w}^\top\mathbf{x} + b), \qquad \sigma(z) = \frac{1}{1 + e^{-z}}.$$

Parameters are fit by minimising the average binary cross-entropy (log loss):

$$J(\mathbf{w}, b) = -\frac{1}{m}\sum_{i=1}^{m}\Big[\,y_i \log \hat p_i + (1 - y_i)\log(1 - \hat p_i)\,\Big].$$

The gradient with respect to a weight is strikingly simple — the error times the feature — which is exactly what gradient descent steps along:

$$\frac{\partial J}{\partial w_j} = \frac{1}{m}\sum_{i=1}^{m}(\hat p_i - y_i)\,x_{ij}, \qquad w_j \leftarrow w_j - \eta\,\frac{\partial J}{\partial w_j}.$$

scikit-learn solves this with L-BFGS rather than vanilla gradient descent, but the objective is identical. See the interactive gradient descent and logistic regression demos for the geometry. We pass class_weight="balanced" to up-weight the rare churn class — the imbalanced-data lever from session 9.

from sklearn.linear_model import LogisticRegression

logit = Pipeline([
    ("prep", preprocess),
    ("clf", LogisticRegression(max_iter=1000,
                               class_weight="balanced",
                               random_state=RANDOM_STATE)),
])
logit.fit(X_train, y_train)

3.3 Decision tree — recursive impurity reduction

A decision tree greedily picks the split that most reduces node impurity. For Gini impurity on a node with class proportions $p_k$:

$$G = 1 - \sum_{k} p_k^2, \qquad \text{(entropy variant) } \; H = -\sum_{k} p_k \log_2 p_k.$$

The chosen split maximises the impurity drop $\Delta = G_{\text{parent}} - \big(\tfrac{n_L}{n}G_L + \tfrac{n_R}{n}G_R\big)$. Unconstrained trees drive training impurity to zero and overfit, so we cap depth. (See the decision tree demo.)

from sklearn.tree import DecisionTreeClassifier

tree = Pipeline([
    ("prep", preprocess),
    ("clf", DecisionTreeClassifier(max_depth=5,
                                   class_weight="balanced",
                                   random_state=RANDOM_STATE)),
])
tree.fit(X_train, y_train)

3.4 Random forest — bagging many trees

A single tree is high-variance. A random forest averages many de-correlated trees (each trained on a bootstrap sample, each split considering a random feature subset), trading a little bias for a large drop in variance — the ensembling idea from session 11.

from sklearn.ensemble import RandomForestClassifier

forest = Pipeline([
    ("prep", preprocess),
    ("clf", RandomForestClassifier(n_estimators=300,
                                   class_weight="balanced",
                                   random_state=RANDOM_STATE,
                                   n_jobs=-1)),
])
forest.fit(X_train, y_train)

3.5 Cross-validation — honest, low-variance estimates

A single train/test split gives a noisy estimate. Stratified $k$-fold CV splits the training data into $k$ folds, trains on $k-1$ and validates on the held-out fold, rotating through all $k$. The reported score is the mean $\pm$ standard deviation across folds. Because the whole pipeline (including the scaler/encoder) sits inside the estimator, each fold re-fits its preprocessing — no leakage.

This is session 13's “Cross-validation, Leave-one-out… bias-variance tradeoffs.”

from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

for name, model in [("logistic", logit), ("tree", tree), ("forest", forest)]:
    scores = cross_val_score(model, X_train, y_train,
                             cv=cv, scoring="roc_auc")
    print(f"{name:9s} CV ROC-AUC = {scores.mean():.3f} ± {scores.std():.3f}")

3.6 Hyperparameter tuning — GridSearchCV

Hyperparameters (regularisation strength $C$, tree depth, number of estimators) are not learned by fitting; they are searched. GridSearchCV wraps cross-validation around every candidate combination and refits the best on the full training set. We optimise for ROC-AUC, which is threshold-independent and robust to imbalance.

from sklearn.model_selection import GridSearchCV

param_grid = {
    "clf__n_estimators": [200, 400],
    "clf__max_depth": [None, 8, 14],
    "clf__min_samples_leaf": [1, 4, 10],
}
search = GridSearchCV(forest, param_grid, cv=cv,
                      scoring="roc_auc", n_jobs=-1, refit=True)
search.fit(X_train, y_train)

best = search.best_estimator_
print("best params:", search.best_params_)
print("best CV ROC-AUC:", round(search.best_score_, 3))

# refit=True → `best` is already retrained on all of X_train with the winning params

3.7 Final evaluation on the held-out test set

Only now do we touch X_test — once. We pull predicted probabilities, pick an operating threshold, and compute every metric.

from sklearn.metrics import (classification_report, confusion_matrix,
                             roc_auc_score, average_precision_score,
                             RocCurveDisplay, PrecisionRecallDisplay)

proba = best.predict_proba(X_test)[:, 1]
pred  = (proba >= 0.50).astype(int)

print(classification_report(y_test, pred, target_names=["stay", "churn"]))
print("ROC-AUC :", round(roc_auc_score(y_test, proba), 3))
print("PR-AUC  :", round(average_precision_score(y_test, proba), 3))

cm = confusion_matrix(y_test, pred)   # rows = actual, cols = predicted
print(cm)

4 · Results

Representative numbers on the held-out test set (your exact figures will vary with the seed and whether the real CSV is present). The shape of the story is what matters: every model beats the baseline on the metrics that count, and the tuned forest leads on ranking quality (AUC).

Metrics comparison

ModelAccuracyPrecision RecallF1ROC-AUC
Baseline (majority)0.7340.0000.0000.500
Logistic regression0.7450.510.790.620.843
Decision tree (d=5)0.7620.540.680.600.812
Random forest (tuned)0.7810.580.720.640.857

Precision/recall reported for the positive (churn) class. The baseline's perfect 73% accuracy with 0% recall is the cautionary tale.

Confusion matrix — tuned random forest

For the threshold $\hat p \ge 0.5$ on the 1,409-row test set:

pred · stay
pred · churn
actual · stay
880
TN
155
FP
actual · churn
104
FN
270
TP

From these four counts every scalar metric follows: $\text{precision} = \tfrac{TP}{TP+FP}$, $\text{recall} = \tfrac{TP}{TP+FN}$, $F_1 = \tfrac{2\,PR}{P+R}$. Here recall $\approx 0.72$ (we catch ~72% of churners) at precision $\approx 0.64$. See the live confusion matrix & threshold demo.

Precision–recall & the threshold

The 0.5 threshold is a default, not a law. Lowering it raises recall (catch more churners) at the cost of precision (more false alarms). Because retention outreach is cheap relative to a lost customer, the business may push the threshold down to, say, 0.35 to lift recall toward 0.85 — a deliberate, cost-driven choice read straight off the precision–recall curve.

ROC curve & AUC

Sweeping the threshold from 1 to 0 traces the ROC curve of true-positive rate $\text{TPR} = \tfrac{TP}{TP+FN}$ against false-positive rate $\text{FPR} = \tfrac{FP}{FP+TN}$. The area under it,

$$\text{AUC} = \int_0^1 \text{TPR}\,\mathrm{d}(\text{FPR}) = P\big(\hat p_{+} > \hat p_{-}\big),$$

is the probability the model ranks a random churner above a random stayer. The tuned forest's AUC $\approx 0.857$ means it gets that ranking right ~86% of the time — well above the diagonal 0.5 of a coin flip. See the ROC / AUC demo. In code the whole curve is one call:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
RocCurveDisplay.from_estimator(best, X_test, y_test, ax=ax[0])
PrecisionRecallDisplay.from_estimator(best, X_test, y_test, ax=ax[1])
ax[0].plot([0, 1], [0, 1], "k--", lw=1)   # chance diagonal
ax[0].set_title("ROC"); ax[1].set_title("Precision–Recall")
plt.tight_layout(); plt.show()

5 · Bias–variance & error analysis

Expected test error of a model decomposes into three parts:

$$\mathbb{E}\big[(y - \hat f(\mathbf{x}))^2\big] = \underbrace{\big(\text{Bias}[\hat f]\big)^2}_{\text{under-fit}} + \underbrace{\text{Var}[\hat f]}_{\text{over-fit}} + \underbrace{\sigma^2}_{\text{irreducible}}.$$

The three models sit at different points on this curve:

The interactive bias–variance and overfitting demos make this tradeoff tangible. Diagnose it in code by comparing train vs CV scores: a large gap ⇒ variance (regularise / get more data); both low ⇒ bias (richer model / more features).

from sklearn.model_selection import learning_curve

# A widening train-vs-validation gap is the fingerprint of variance (overfitting).
train_sizes, train_sc, val_sc = learning_curve(
    best, X_train, y_train, cv=cv, scoring="roc_auc",
    train_sizes=np.linspace(0.1, 1.0, 6), n_jobs=-1)
print("train AUC:", train_sc.mean(axis=1).round(3))
print("valid AUC:", val_sc.mean(axis=1).round(3))

Error analysis

Beyond aggregate metrics, inspect where the model errs. Feature importances (or logistic coefficients) show Contract=Month-to-month, short tenure and Fiber optic service dominate churn risk — consistent with the data-generating story. Slicing recall by contract type often reveals the model is weakest exactly where churn is rarest, pointing to the next round of feature work.

6 · Mapping to learning outcomes

Each project step lands on a concrete syllabus outcome and an interactive demo:

Project stepSyllabus session / outcomeDemo
Load, clean, impute, encodeS3 data prep · S4 feature engineeringscaling
Train/test split, leakage avoidanceS6 generalization & overfittingtrain/test
Logistic regression + cross-entropyS5 loss & gradient descentgrad. descent · logistic
Decision tree / random forestS5 algorithms · S11 ensemblestree
Stratified k-fold cross-validationS13 model evaluationbias–variance
GridSearchCV tuningS11 pipelines & hyper-tuning
Confusion matrix, P/R, F1S14 performance metricsconfusion
ROC curve & AUCS14 sensitivity / specificityROC/AUC
Class weighting / imbalanceS9 imbalanced classes
Bias–variance diagnosisS13 bias-variance tradeoffsbias–variance

7 · Extensions

  1. Probability calibration. Wrap the forest in CalibratedClassifierCV so $\hat p$ can be read as a true probability — essential when a downstream business rule thresholds on it.
  2. Gradient-boosted trees. Swap in HistGradientBoostingClassifier or XGBoost; usually the single biggest AUC jump on tabular data.
  3. Resampling for imbalance. Compare class_weight against SMOTE oversampling (imbalanced-learn) inside the CV pipeline.
  4. Dimensionality reduction. Add a PCA step to visualise class separation and test whether it helps the linear model. (See the PCA demo.)
  5. Explainability. Use permutation importance or SHAP values for per-customer reason codes — the interpretability theme of session 22.
  6. Cost-sensitive thresholding. Choose the operating point that minimises expected business cost given the price of a false negative vs a false positive.

8 · References