Worked example — predicting customer churn end-to-end
A single, complete supervised-learning project carried from raw data to an evaluated, tuned classifier. We predict which telecom customers will churn (cancel their subscription) — a canonical binary-classification business problem. Every step below is real, runnable Python (scikit-learn, pandas, numpy); the underlying mathematics is written out in KaTeX so the code and the theory line up.
Goal
Given a customer's contract, billing and usage attributes, output a calibrated probability $\hat p = P(\text{churn}=1 \mid \mathbf{x})$ and a class decision. The business cares more about recall on churners (catching at-risk customers before they leave) than raw accuracy, so we evaluate with the full confusion matrix, precision–recall and ROC-AUC rather than accuracy alone.
Stack
Sessions exercised
This project deliberately threads together most of the course's first half. It maps onto the syllabus as follows (see the program):
S3 · data prep S4 · feature engineering S5 · fundamental algorithms S6 · supervised training S7 · first end-to-end practice S9 · imbalanced classes S11 · pipelines & hyper-tuning S13 · model evaluation & CV S14 · performance metrics
It is the concrete companion to session 7's “A First Basic Practice — end to end practical case, from EDA to model training and validation” and session 12's “A First Practical Overview of Typical Applications.”
1 · Dataset & problem framing
We use the well-known Telco Customer Churn table (7,043 rows, ~20 features): a mix of categorical attributes (Contract, InternetService, PaymentMethod, …) and numeric ones (tenure, MonthlyCharges, TotalCharges). The target is Churn ∈ {Yes, No}, which we encode as $y \in \{1, 0\}$. The code reads from a local CSV; if it is absent it synthesises a structurally identical frame so the notebook always runs.
import numpy as np
import pandas as pd
from pathlib import Path
RANDOM_STATE = 42
def load_churn(path="telco_churn.csv"):
"""Load the Telco churn table, or synthesise a stand-in with the same schema."""
if Path(path).exists():
df = pd.read_csv(path)
else:
rng = np.random.default_rng(RANDOM_STATE)
n = 7043
tenure = rng.integers(0, 73, n)
monthly = rng.uniform(18, 119, n).round(2)
contract = rng.choice(["Month-to-month", "One year", "Two year"],
size=n, p=[0.55, 0.21, 0.24])
internet = rng.choice(["DSL", "Fiber optic", "No"], size=n, p=[0.34, 0.44, 0.22])
# churn risk rises for short tenure, high bill, month-to-month, fiber
logit = (-2.4
- 0.045 * tenure
+ 0.018 * monthly
+ (contract == "Month-to-month") * 1.6
+ (internet == "Fiber optic") * 0.7)
p = 1 / (1 + np.exp(-logit))
churn = (rng.random(n) < p).astype(int)
df = pd.DataFrame({
"tenure": tenure,
"MonthlyCharges": monthly,
"TotalCharges": (tenure * monthly).round(2),
"Contract": contract,
"InternetService": internet,
"PaymentMethod": rng.choice(
["Electronic check", "Mailed check",
"Bank transfer", "Credit card"], size=n),
"Churn": np.where(churn == 1, "Yes", "No"),
})
return df
df = load_churn()
# TotalCharges sometimes arrives as blank strings in the real file:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.dropna(subset=["TotalCharges"]).reset_index(drop=True)
y = (df["Churn"] == "Yes").astype(int)
X = df.drop(columns=["Churn"])
print(X.shape, "churn rate:", round(y.mean(), 3))
# (7043, 6) churn rate: ~0.27
Why this framing
- Supervised, binary: labelled outcome, two classes — the cleanest setting to exercise the metrics from session 14.
- Mixed feature types: forces a real preprocessing pipeline (scaling numerics, one-hot encoding categoricals) rather than a toy make_classification blob.
- Imbalanced & cost-sensitive: mirrors session 9's “managing imbalanced classes” — a false negative (a churner we miss) costs far more than a false positive.
2 · Pipeline design — split, leakage, scaling
Train / test split
We hold out 20% of the data as a final, untouched test set, and stratify on $y$ so the churn rate is identical in both folds. The held-out set is touched exactly once, at the very end. (See the interactive train/test split & overfitting demo.)
Avoiding data leakage
The single most common project-killing bug: fitting any transformer (scaler, encoder, imputer)
on the whole dataset before splitting. The test set's statistics then leak into training and
inflate the score. The fix is to wrap every transform inside a
Pipeline / ColumnTransformer so that
.fit() only ever sees training rows, and cross-validation re-fits the
transforms inside each fold.
Feature scaling & encoding
Logistic regression and distance-based models need comparable feature ranges, so numerics are
standardised. Tree ensembles are scale-invariant, but we keep one preprocessor for all models so the
comparison is apples-to-apples. Categoricals are one-hot encoded with
handle_unknown="ignore" so unseen categories at test time do not crash.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
cat_cols = ["Contract", "InternetService", "PaymentMethod"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, stratify=y, random_state=RANDOM_STATE)
numeric = Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
])
categorical = Pipeline([
("impute", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
])
preprocess = ColumnTransformer([
("num", numeric, num_cols),
("cat", categorical, cat_cols),
]) # fit happens ONLY on training data, inside the pipeline
# preprocess is now a reusable, leakage-safe first stage for every model
3 · Step-by-step implementation
3.1 Baseline — always predict the majority class
Before any real model, establish the floor. A DummyClassifier that
always predicts “no churn” fixes the number every later model must beat. If a fancy model can't
beat the baseline, the features carry no signal.
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score
baseline = DummyClassifier(strategy="most_frequent").fit(X_train, y_train)
pred = baseline.predict(X_test)
print("baseline accuracy:", round(accuracy_score(y_test, pred), 3)) # ~0.73
print("baseline F1 (churn):", round(f1_score(y_test, pred), 3)) # 0.0 — catches no churners
3.2 Logistic regression — the linear probabilistic model
Logistic regression models the log-odds as a linear function of the features and squashes it through the sigmoid:
$$\hat p = \sigma(\mathbf{w}^\top\mathbf{x} + b), \qquad \sigma(z) = \frac{1}{1 + e^{-z}}.$$
Parameters are fit by minimising the average binary cross-entropy (log loss):
$$J(\mathbf{w}, b) = -\frac{1}{m}\sum_{i=1}^{m}\Big[\,y_i \log \hat p_i + (1 - y_i)\log(1 - \hat p_i)\,\Big].$$
The gradient with respect to a weight is strikingly simple — the error times the feature — which is exactly what gradient descent steps along:
$$\frac{\partial J}{\partial w_j} = \frac{1}{m}\sum_{i=1}^{m}(\hat p_i - y_i)\,x_{ij}, \qquad w_j \leftarrow w_j - \eta\,\frac{\partial J}{\partial w_j}.$$
scikit-learn solves this with L-BFGS rather than vanilla gradient descent, but the
objective is identical. See the interactive
gradient descent and
logistic regression demos for the geometry. We pass
class_weight="balanced" to up-weight the rare churn class — the
imbalanced-data lever from session 9.
from sklearn.linear_model import LogisticRegression
logit = Pipeline([
("prep", preprocess),
("clf", LogisticRegression(max_iter=1000,
class_weight="balanced",
random_state=RANDOM_STATE)),
])
logit.fit(X_train, y_train)
3.3 Decision tree — recursive impurity reduction
A decision tree greedily picks the split that most reduces node impurity. For Gini impurity on a node with class proportions $p_k$:
$$G = 1 - \sum_{k} p_k^2, \qquad \text{(entropy variant) } \; H = -\sum_{k} p_k \log_2 p_k.$$
The chosen split maximises the impurity drop $\Delta = G_{\text{parent}} - \big(\tfrac{n_L}{n}G_L + \tfrac{n_R}{n}G_R\big)$. Unconstrained trees drive training impurity to zero and overfit, so we cap depth. (See the decision tree demo.)
from sklearn.tree import DecisionTreeClassifier
tree = Pipeline([
("prep", preprocess),
("clf", DecisionTreeClassifier(max_depth=5,
class_weight="balanced",
random_state=RANDOM_STATE)),
])
tree.fit(X_train, y_train)
3.4 Random forest — bagging many trees
A single tree is high-variance. A random forest averages many de-correlated trees (each trained on a bootstrap sample, each split considering a random feature subset), trading a little bias for a large drop in variance — the ensembling idea from session 11.
from sklearn.ensemble import RandomForestClassifier
forest = Pipeline([
("prep", preprocess),
("clf", RandomForestClassifier(n_estimators=300,
class_weight="balanced",
random_state=RANDOM_STATE,
n_jobs=-1)),
])
forest.fit(X_train, y_train)
3.5 Cross-validation — honest, low-variance estimates
A single train/test split gives a noisy estimate. Stratified $k$-fold CV splits the training data into $k$ folds, trains on $k-1$ and validates on the held-out fold, rotating through all $k$. The reported score is the mean $\pm$ standard deviation across folds. Because the whole pipeline (including the scaler/encoder) sits inside the estimator, each fold re-fits its preprocessing — no leakage.
This is session 13's “Cross-validation, Leave-one-out… bias-variance tradeoffs.”
from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
for name, model in [("logistic", logit), ("tree", tree), ("forest", forest)]:
scores = cross_val_score(model, X_train, y_train,
cv=cv, scoring="roc_auc")
print(f"{name:9s} CV ROC-AUC = {scores.mean():.3f} ± {scores.std():.3f}")
3.6 Hyperparameter tuning — GridSearchCV
Hyperparameters (regularisation strength $C$, tree depth, number of estimators) are not
learned by fitting; they are searched. GridSearchCV wraps cross-validation
around every candidate combination and refits the best on the full training set. We optimise for
ROC-AUC, which is threshold-independent and robust to imbalance.
from sklearn.model_selection import GridSearchCV
param_grid = {
"clf__n_estimators": [200, 400],
"clf__max_depth": [None, 8, 14],
"clf__min_samples_leaf": [1, 4, 10],
}
search = GridSearchCV(forest, param_grid, cv=cv,
scoring="roc_auc", n_jobs=-1, refit=True)
search.fit(X_train, y_train)
best = search.best_estimator_
print("best params:", search.best_params_)
print("best CV ROC-AUC:", round(search.best_score_, 3))
# refit=True → `best` is already retrained on all of X_train with the winning params
3.7 Final evaluation on the held-out test set
Only now do we touch X_test — once. We pull predicted probabilities,
pick an operating threshold, and compute every metric.
from sklearn.metrics import (classification_report, confusion_matrix,
roc_auc_score, average_precision_score,
RocCurveDisplay, PrecisionRecallDisplay)
proba = best.predict_proba(X_test)[:, 1]
pred = (proba >= 0.50).astype(int)
print(classification_report(y_test, pred, target_names=["stay", "churn"]))
print("ROC-AUC :", round(roc_auc_score(y_test, proba), 3))
print("PR-AUC :", round(average_precision_score(y_test, proba), 3))
cm = confusion_matrix(y_test, pred) # rows = actual, cols = predicted
print(cm)
4 · Results
Representative numbers on the held-out test set (your exact figures will vary with the seed and whether the real CSV is present). The shape of the story is what matters: every model beats the baseline on the metrics that count, and the tuned forest leads on ranking quality (AUC).
Metrics comparison
| Model | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| Baseline (majority) | 0.734 | — | 0.000 | 0.000 | 0.500 |
| Logistic regression | 0.745 | 0.51 | 0.79 | 0.62 | 0.843 |
| Decision tree (d=5) | 0.762 | 0.54 | 0.68 | 0.60 | 0.812 |
| Random forest (tuned) | 0.781 | 0.58 | 0.72 | 0.64 | 0.857 |
Precision/recall reported for the positive (churn) class. The baseline's perfect 73% accuracy with 0% recall is the cautionary tale.
Confusion matrix — tuned random forest
For the threshold $\hat p \ge 0.5$ on the 1,409-row test set:
From these four counts every scalar metric follows: $\text{precision} = \tfrac{TP}{TP+FP}$, $\text{recall} = \tfrac{TP}{TP+FN}$, $F_1 = \tfrac{2\,PR}{P+R}$. Here recall $\approx 0.72$ (we catch ~72% of churners) at precision $\approx 0.64$. See the live confusion matrix & threshold demo.
Precision–recall & the threshold
The 0.5 threshold is a default, not a law. Lowering it raises recall (catch more churners) at the cost of precision (more false alarms). Because retention outreach is cheap relative to a lost customer, the business may push the threshold down to, say, 0.35 to lift recall toward 0.85 — a deliberate, cost-driven choice read straight off the precision–recall curve.
ROC curve & AUC
Sweeping the threshold from 1 to 0 traces the ROC curve of true-positive rate $\text{TPR} = \tfrac{TP}{TP+FN}$ against false-positive rate $\text{FPR} = \tfrac{FP}{FP+TN}$. The area under it,
$$\text{AUC} = \int_0^1 \text{TPR}\,\mathrm{d}(\text{FPR}) = P\big(\hat p_{+} > \hat p_{-}\big),$$
is the probability the model ranks a random churner above a random stayer. The tuned forest's AUC $\approx 0.857$ means it gets that ranking right ~86% of the time — well above the diagonal 0.5 of a coin flip. See the ROC / AUC demo. In code the whole curve is one call:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
RocCurveDisplay.from_estimator(best, X_test, y_test, ax=ax[0])
PrecisionRecallDisplay.from_estimator(best, X_test, y_test, ax=ax[1])
ax[0].plot([0, 1], [0, 1], "k--", lw=1) # chance diagonal
ax[0].set_title("ROC"); ax[1].set_title("Precision–Recall")
plt.tight_layout(); plt.show()
5 · Bias–variance & error analysis
Expected test error of a model decomposes into three parts:
$$\mathbb{E}\big[(y - \hat f(\mathbf{x}))^2\big] = \underbrace{\big(\text{Bias}[\hat f]\big)^2}_{\text{under-fit}} + \underbrace{\text{Var}[\hat f]}_{\text{over-fit}} + \underbrace{\sigma^2}_{\text{irreducible}}.$$
The three models sit at different points on this curve:
- Logistic regression is high-bias / low-variance: its linear boundary can't capture interactions, so train and test scores are close but both modest. It under-fits.
- A deep, unpruned tree is the opposite — near-perfect training accuracy, a large train→test gap. It over-fits (memorises noise).
- The random forest tames a tree's variance by averaging, landing near the sweet spot — which is why it tops the AUC table.
The interactive bias–variance and overfitting demos make this tradeoff tangible. Diagnose it in code by comparing train vs CV scores: a large gap ⇒ variance (regularise / get more data); both low ⇒ bias (richer model / more features).
from sklearn.model_selection import learning_curve
# A widening train-vs-validation gap is the fingerprint of variance (overfitting).
train_sizes, train_sc, val_sc = learning_curve(
best, X_train, y_train, cv=cv, scoring="roc_auc",
train_sizes=np.linspace(0.1, 1.0, 6), n_jobs=-1)
print("train AUC:", train_sc.mean(axis=1).round(3))
print("valid AUC:", val_sc.mean(axis=1).round(3))
Error analysis
Beyond aggregate metrics, inspect where the model errs. Feature importances (or logistic coefficients) show Contract=Month-to-month, short tenure and Fiber optic service dominate churn risk — consistent with the data-generating story. Slicing recall by contract type often reveals the model is weakest exactly where churn is rarest, pointing to the next round of feature work.
6 · Mapping to learning outcomes
Each project step lands on a concrete syllabus outcome and an interactive demo:
| Project step | Syllabus session / outcome | Demo |
|---|---|---|
| Load, clean, impute, encode | S3 data prep · S4 feature engineering | scaling |
| Train/test split, leakage avoidance | S6 generalization & overfitting | train/test |
| Logistic regression + cross-entropy | S5 loss & gradient descent | grad. descent · logistic |
| Decision tree / random forest | S5 algorithms · S11 ensembles | tree |
| Stratified k-fold cross-validation | S13 model evaluation | bias–variance |
| GridSearchCV tuning | S11 pipelines & hyper-tuning | — |
| Confusion matrix, P/R, F1 | S14 performance metrics | confusion |
| ROC curve & AUC | S14 sensitivity / specificity | ROC/AUC |
| Class weighting / imbalance | S9 imbalanced classes | — |
| Bias–variance diagnosis | S13 bias-variance tradeoffs | bias–variance |
7 · Extensions
- Probability calibration. Wrap the forest in
CalibratedClassifierCVso $\hat p$ can be read as a true probability — essential when a downstream business rule thresholds on it. - Gradient-boosted trees. Swap in
HistGradientBoostingClassifieror XGBoost; usually the single biggest AUC jump on tabular data. - Resampling for imbalance. Compare class_weight against SMOTE oversampling (imbalanced-learn) inside the CV pipeline.
- Dimensionality reduction. Add a PCA step to visualise class separation and test whether it helps the linear model. (See the PCA demo.)
- Explainability. Use permutation importance or SHAP values for per-customer reason codes — the interpretability theme of session 22.
- Cost-sensitive thresholding. Choose the operating point that minimises expected business cost given the price of a false negative vs a false positive.
8 · References
- Burkov, A. (2020). Machine Learning Engineering. True Positive Inc. — course compulsory text (esp. Ch. 3 data, Ch. 4 features, Ch. 5–6 supervised training, Ch. 7 evaluation).
- Burkov, A. (2019). The Hundred-Page Machine Learning Book. — recommended.
- Pedregosa et al. (2011). scikit-learn: Machine Learning in Python. JMLR 12, 2825–2830. See the User Guide on Pipeline, ColumnTransformer, GridSearchCV and model evaluation.
- Hastie, Tibshirani & Friedman (2009). The Elements of Statistical Learning — bias–variance decomposition, trees, ensembles.
- ml-lab interactive demos: index.html · course outline: course.html.