ml-lab · worked example — end-to-end churn prediction

Worked example — predicting customer churn end-to-end

A single, complete supervised-learning project carried from raw data to an evaluated, tuned classifier. We predict which telecom customers will churn (cancel their subscription) — a canonical binary-classification business problem. Every step below is real, runnable Python (scikit-learn, pandas, numpy); the underlying mathematics is written out in KaTeX so the code and the theory line up.

Goal

Given a customer's contract, billing and usage attributes, output a calibrated probability $\hat p = P(\text{churn}=1 \mid \mathbf{x})$ and a class decision. The business cares more about recall on churners (catching at-risk customers before they leave) than raw accuracy, so we evaluate with the full confusion matrix, precision–recall and ROC-AUC rather than accuracy alone.

Stack

Language

Python 3.11

Core libs

scikit-learn · pandas · numpy

Task

Binary classification

Models

Baseline · Logistic · Tree · Random Forest

Validation

Stratified 5-fold CV

Tuning

GridSearchCV

Sessions exercised

This project deliberately threads together most of the course's first half. It maps onto the syllabus as follows (see the program):

S3 · data prep S4 · feature engineering S5 · fundamental algorithms S6 · supervised training S7 · first end-to-end practice S9 · imbalanced classes S11 · pipelines & hyper-tuning S13 · model evaluation & CV S14 · performance metrics

It is the concrete companion to session 7's “A First Basic Practice — end to end practical case, from EDA to model training and validation” and session 12's “A First Practical Overview of Typical Applications.”

1 · Dataset & problem framing

We use the well-known Telco Customer Churn table (7,043 rows, ~20 features): a mix of categorical attributes (Contract, InternetService, PaymentMethod, …) and numeric ones (tenure, MonthlyCharges, TotalCharges). The target is Churn ∈ {Yes, No}, which we encode as $y \in \{1, 0\}$. The code reads from a local CSV; if it is absent it synthesises a structurally identical frame so the notebook always runs.

Class imbalance. Roughly 27% of customers churn. A model that always predicts “no churn” would already score ~73% accuracy while catching zero churners — the first reason accuracy is the wrong headline metric here.

import numpy as np
import pandas as pd
from pathlib import Path

RANDOM_STATE = 42

def load_churn(path="telco_churn.csv"):
    """Load the Telco churn table, or synthesise a stand-in with the same schema."""
    if Path(path).exists():
        df = pd.read_csv(path)
    else:
        rng = np.random.default_rng(RANDOM_STATE)
        n = 7043
        tenure = rng.integers(0, 73, n)
        monthly = rng.uniform(18, 119, n).round(2)
        contract = rng.choice(["Month-to-month", "One year", "Two year"],
                              size=n, p=[0.55, 0.21, 0.24])
        internet = rng.choice(["DSL", "Fiber optic", "No"], size=n, p=[0.34, 0.44, 0.22])
        # churn risk rises for short tenure, high bill, month-to-month, fiber
        logit = (-2.4
                 - 0.045 * tenure
                 + 0.018 * monthly
                 + (contract == "Month-to-month") * 1.6
                 + (internet == "Fiber optic") * 0.7)
        p = 1 / (1 + np.exp(-logit))
        churn = (rng.random(n) < p).astype(int)
        df = pd.DataFrame({
            "tenure": tenure,
            "MonthlyCharges": monthly,
            "TotalCharges": (tenure * monthly).round(2),
            "Contract": contract,
            "InternetService": internet,
            "PaymentMethod": rng.choice(
                ["Electronic check", "Mailed check",
                 "Bank transfer", "Credit card"], size=n),
            "Churn": np.where(churn == 1, "Yes", "No"),
        })
    return df

df = load_churn()
# TotalCharges sometimes arrives as blank strings in the real file:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.dropna(subset=["TotalCharges"]).reset_index(drop=True)

y = (df["Churn"] == "Yes").astype(int)
X = df.drop(columns=["Churn"])
print(X.shape, "churn rate:", round(y.mean(), 3))

# (7043, 6) churn rate: ~0.27

Why this framing

Supervised, binary: labelled outcome, two classes — the cleanest setting to exercise the metrics from session 14.
Mixed feature types: forces a real preprocessing pipeline (scaling numerics, one-hot encoding categoricals) rather than a toy make_classification blob.
Imbalanced & cost-sensitive: mirrors session 9's “managing imbalanced classes” — a false negative (a churner we miss) costs far more than a false positive.

2 · Pipeline design — split, leakage, scaling

Train / test split

We hold out 20% of the data as a final, untouched test set, and stratify on $y$ so the churn rate is identical in both folds. The held-out set is touched exactly once, at the very end. (See the interactive train/test split & overfitting demo.)

Avoiding data leakage

The single most common project-killing bug: fitting any transformer (scaler, encoder, imputer) on the whole dataset before splitting. The test set's statistics then leak into training and inflate the score. The fix is to wrap every transform inside a Pipeline / ColumnTransformer so that .fit() only ever sees training rows, and cross-validation re-fits the transforms inside each fold.

Rule. Fit on train, transform on both. Standardisation uses $z = \dfrac{x - \mu_{\text{train}}}{\sigma_{\text{train}}}$ where $\mu_{\text{train}}, \sigma_{\text{train}}$ come from the training fold only — never the global mean. (See the feature scaling demo.)

Feature scaling & encoding

Logistic regression and distance-based models need comparable feature ranges, so numerics are standardised. Tree ensembles are scale-invariant, but we keep one preprocessor for all models so the comparison is apples-to-apples. Categoricals are one-hot encoded with handle_unknown="ignore" so unseen categories at test time do not crash.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
cat_cols = ["Contract", "InternetService", "PaymentMethod"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=RANDOM_STATE)

numeric = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),
])
categorical = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer([
    ("num", numeric, num_cols),
    ("cat", categorical, cat_cols),
])  # fit happens ONLY on training data, inside the pipeline

# preprocess is now a reusable, leakage-safe first stage for every model

3 · Step-by-step implementation

3.1 Baseline — always predict the majority class

Before any real model, establish the floor. A DummyClassifier that always predicts “no churn” fixes the number every later model must beat. If a fancy model can't beat the baseline, the features carry no signal.

from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score

baseline = DummyClassifier(strategy="most_frequent").fit(X_train, y_train)
pred = baseline.predict(X_test)
print("baseline accuracy:", round(accuracy_score(y_test, pred), 3))  # ~0.73
print("baseline F1 (churn):", round(f1_score(y_test, pred), 3))      # 0.0 — catches no churners

3.2 Logistic regression — the linear probabilistic model

Logistic regression models the log-odds as a linear function of the features and squashes it through the sigmoid:

$$\hat p = \sigma(\mathbf{w}^\top\mathbf{x} + b), \qquad \sigma(z) = \frac{1}{1 + e^{-z}}.$$

Parameters are fit by minimising the average binary cross-entropy (log loss):

$$J(\mathbf{w}, b) = -\frac{1}{m}\sum_{i=1}^{m}\Big[\,y_i \log \hat p_i + (1 - y_i)\log(1 - \hat p_i)\,\Big].$$

The gradient with respect to a weight is strikingly simple — the error times the feature — which is exactly what gradient descent steps along:

$$\frac{\partial J}{\partial w_j} = \frac{1}{m}\sum_{i=1}^{m}(\hat p_i - y_i)\,x_{ij}, \qquad w_j \leftarrow w_j - \eta\,\frac{\partial J}{\partial w_j}.$$

scikit-learn solves this with L-BFGS rather than vanilla gradient descent, but the objective is identical. See the interactive gradient descent and logistic regression demos for the geometry. We pass class_weight="balanced" to up-weight the rare churn class — the imbalanced-data lever from session 9.

from sklearn.linear_model import LogisticRegression

logit = Pipeline([
    ("prep", preprocess),
    ("clf", LogisticRegression(max_iter=1000,
                               class_weight="balanced",
                               random_state=RANDOM_STATE)),
])
logit.fit(X_train, y_train)

3.3 Decision tree — recursive impurity reduction

A decision tree greedily picks the split that most reduces node impurity. For Gini impurity on a node with class proportions $p_k$:

$$G = 1 - \sum_{k} p_k^2, \qquad \text{(entropy variant) } \; H = -\sum_{k} p_k \log_2 p_k.$$

The chosen split maximises the impurity drop $\Delta = G_{\text{parent}} - \big(\tfrac{n_L}{n}G_L + \tfrac{n_R}{n}G_R\big)$. Unconstrained trees drive training impurity to zero and overfit, so we cap depth. (See the decision tree demo.)

from sklearn.tree import DecisionTreeClassifier

tree = Pipeline([
    ("prep", preprocess),
    ("clf", DecisionTreeClassifier(max_depth=5,
                                   class_weight="balanced",
                                   random_state=RANDOM_STATE)),
])
tree.fit(X_train, y_train)

3.4 Random forest — bagging many trees

A single tree is high-variance. A random forest averages many de-correlated trees (each trained on a bootstrap sample, each split considering a random feature subset), trading a little bias for a large drop in variance — the ensembling idea from session 11.

from sklearn.ensemble import RandomForestClassifier

forest = Pipeline([
    ("prep", preprocess),
    ("clf", RandomForestClassifier(n_estimators=300,
                                   class_weight="balanced",
                                   random_state=RANDOM_STATE,
                                   n_jobs=-1)),
])
forest.fit(X_train, y_train)

3.5 Cross-validation — honest, low-variance estimates

A single train/test split gives a noisy estimate. Stratified $k$-fold CV splits the training data into $k$ folds, trains on $k-1$ and validates on the held-out fold, rotating through all $k$. The reported score is the mean $\pm$ standard deviation across folds. Because the whole pipeline (including the scaler/encoder) sits inside the estimator, each fold re-fits its preprocessing — no leakage.

This is session 13's “Cross-validation, Leave-one-out… bias-variance tradeoffs.”

from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

for name, model in [("logistic", logit), ("tree", tree), ("forest", forest)]:
    scores = cross_val_score(model, X_train, y_train,
                             cv=cv, scoring="roc_auc")
    print(f"{name:9s} CV ROC-AUC = {scores.mean():.3f} ± {scores.std():.3f}")

3.6 Hyperparameter tuning — GridSearchCV

Hyperparameters (regularisation strength $C$, tree depth, number of estimators) are not learned by fitting; they are searched. GridSearchCV wraps cross-validation around every candidate combination and refits the best on the full training set. We optimise for ROC-AUC, which is threshold-independent and robust to imbalance.

from sklearn.model_selection import GridSearchCV

param_grid = {
    "clf__n_estimators": [200, 400],
    "clf__max_depth": [None, 8, 14],
    "clf__min_samples_leaf": [1, 4, 10],
}
search = GridSearchCV(forest, param_grid, cv=cv,
                      scoring="roc_auc", n_jobs=-1, refit=True)
search.fit(X_train, y_train)

best = search.best_estimator_
print("best params:", search.best_params_)
print("best CV ROC-AUC:", round(search.best_score_, 3))

# refit=True → `best` is already retrained on all of X_train with the winning params

3.7 Final evaluation on the held-out test set

Only now do we touch X_test — once. We pull predicted probabilities, pick an operating threshold, and compute every metric.

from sklearn.metrics import (classification_report, confusion_matrix,
                             roc_auc_score, average_precision_score,
                             RocCurveDisplay, PrecisionRecallDisplay)

proba = best.predict_proba(X_test)[:, 1]
pred  = (proba >= 0.50).astype(int)

print(classification_report(y_test, pred, target_names=["stay", "churn"]))
print("ROC-AUC :", round(roc_auc_score(y_test, proba), 3))
print("PR-AUC  :", round(average_precision_score(y_test, proba), 3))

cm = confusion_matrix(y_test, pred)   # rows = actual, cols = predicted
print(cm)

4 · Results

Representative numbers on the held-out test set (your exact figures will vary with the seed and whether the real CSV is present). The shape of the story is what matters: every model beats the baseline on the metrics that count, and the tuned forest leads on ranking quality (AUC).

Metrics comparison

Model	Accuracy	Precision	Recall	F1	ROC-AUC
Baseline (majority)	0.734	—	0.000	0.000	0.500
Logistic regression	0.745	0.51	0.79	0.62	0.843
Decision tree (d=5)	0.762	0.54	0.68	0.60	0.812
Random forest (tuned)	0.781	0.58	0.72	0.64	0.857

Precision/recall reported for the positive (churn) class. The baseline's perfect 73% accuracy with 0% recall is the cautionary tale.

Confusion matrix — tuned random forest

For the threshold $\hat p \ge 0.5$ on the 1,409-row test set:

pred · stay

pred · churn

actual · stay

880

TN

155

FP

actual · churn

104

FN

270

TP

From these four counts every scalar metric follows: $\text{precision} = \tfrac{TP}{TP+FP}$, $\text{recall} = \tfrac{TP}{TP+FN}$, $F_1 = \tfrac{2\,PR}{P+R}$. Here recall $\approx 0.72$ (we catch ~72% of churners) at precision $\approx 0.64$. See the live confusion matrix & threshold demo.

Precision–recall & the threshold

The 0.5 threshold is a default, not a law. Lowering it raises recall (catch more churners) at the cost of precision (more false alarms). Because retention outreach is cheap relative to a lost customer, the business may push the threshold down to, say, 0.35 to lift recall toward 0.85 — a deliberate, cost-driven choice read straight off the precision–recall curve.

ROC curve & AUC

Sweeping the threshold from 1 to 0 traces the ROC curve of true-positive rate $\text{TPR} = \tfrac{TP}{TP+FN}$ against false-positive rate $\text{FPR} = \tfrac{FP}{FP+TN}$. The area under it,

$$\text{AUC} = \int_0^1 \text{TPR}\,\mathrm{d}(\text{FPR}) = P\big(\hat p_{+} > \hat p_{-}\big),$$

is the probability the model ranks a random churner above a random stayer. The tuned forest's AUC $\approx 0.857$ means it gets that ranking right ~86% of the time — well above the diagonal 0.5 of a coin flip. See the ROC / AUC demo. In code the whole curve is one call:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
RocCurveDisplay.from_estimator(best, X_test, y_test, ax=ax[0])
PrecisionRecallDisplay.from_estimator(best, X_test, y_test, ax=ax[1])
ax[0].plot([0, 1], [0, 1], "k--", lw=1)   # chance diagonal
ax[0].set_title("ROC"); ax[1].set_title("Precision–Recall")
plt.tight_layout(); plt.show()

5 · Bias–variance & error analysis

Expected test error of a model decomposes into three parts:

$$\mathbb{E}\big[(y - \hat f(\mathbf{x}))^2\big] = \underbrace{\big(\text{Bias}[\hat f]\big)^2}_{\text{under-fit}} + \underbrace{\text{Var}[\hat f]}_{\text{over-fit}} + \underbrace{\sigma^2}_{\text{irreducible}}.$$

The three models sit at different points on this curve:

Logistic regression is high-bias / low-variance: its linear boundary can't capture interactions, so train and test scores are close but both modest. It under-fits.
A deep, unpruned tree is the opposite — near-perfect training accuracy, a large train→test gap. It over-fits (memorises noise).
The random forest tames a tree's variance by averaging, landing near the sweet spot — which is why it tops the AUC table.

The interactive bias–variance and overfitting demos make this tradeoff tangible. Diagnose it in code by comparing train vs CV scores: a large gap ⇒ variance (regularise / get more data); both low ⇒ bias (richer model / more features).

from sklearn.model_selection import learning_curve

# A widening train-vs-validation gap is the fingerprint of variance (overfitting).
train_sizes, train_sc, val_sc = learning_curve(
    best, X_train, y_train, cv=cv, scoring="roc_auc",
    train_sizes=np.linspace(0.1, 1.0, 6), n_jobs=-1)
print("train AUC:", train_sc.mean(axis=1).round(3))
print("valid AUC:", val_sc.mean(axis=1).round(3))

Error analysis

Beyond aggregate metrics, inspect where the model errs. Feature importances (or logistic coefficients) show Contract=Month-to-month, short tenure and Fiber optic service dominate churn risk — consistent with the data-generating story. Slicing recall by contract type often reveals the model is weakest exactly where churn is rarest, pointing to the next round of feature work.

6 · Mapping to learning outcomes

Each project step lands on a concrete syllabus outcome and an interactive demo:

Project step	Syllabus session / outcome	Demo
Load, clean, impute, encode	S3 data prep · S4 feature engineering	scaling
Train/test split, leakage avoidance	S6 generalization & overfitting	train/test
Logistic regression + cross-entropy	S5 loss & gradient descent	grad. descent · logistic
Decision tree / random forest	S5 algorithms · S11 ensembles	tree
Stratified k-fold cross-validation	S13 model evaluation	bias–variance
GridSearchCV tuning	S11 pipelines & hyper-tuning	—
Confusion matrix, P/R, F1	S14 performance metrics	confusion
ROC curve & AUC	S14 sensitivity / specificity	ROC/AUC
Class weighting / imbalance	S9 imbalanced classes	—
Bias–variance diagnosis	S13 bias-variance tradeoffs	bias–variance

7 · Extensions

Probability calibration. Wrap the forest in CalibratedClassifierCV so $\hat p$ can be read as a true probability — essential when a downstream business rule thresholds on it.
Gradient-boosted trees. Swap in HistGradientBoostingClassifier or XGBoost; usually the single biggest AUC jump on tabular data.
Resampling for imbalance. Compare class_weight against SMOTE oversampling (imbalanced-learn) inside the CV pipeline.
Dimensionality reduction. Add a PCA step to visualise class separation and test whether it helps the linear model. (See the PCA demo.)
Explainability. Use permutation importance or SHAP values for per-customer reason codes — the interpretability theme of session 22.
Cost-sensitive thresholding. Choose the operating point that minimises expected business cost given the price of a false negative vs a false positive.

8 · References

Burkov, A. (2020). Machine Learning Engineering. True Positive Inc. — course compulsory text (esp. Ch. 3 data, Ch. 4 features, Ch. 5–6 supervised training, Ch. 7 evaluation).
Burkov, A. (2019). The Hundred-Page Machine Learning Book. — recommended.
Pedregosa et al. (2011). scikit-learn: Machine Learning in Python. JMLR 12, 2825–2830. See the User Guide on Pipeline, ColumnTransformer, GridSearchCV and model evaluation.
Hastie, Tibshirani & Friedman (2009). The Elements of Statistical Learning — bias–variance decomposition, trees, ensembles.
ml-lab interactive demos: index.html · course outline: course.html.