AI: Statistical Learning & PredictionWorked example project
Worked Example Project

A Predictive-Modeling Bake-Off

An end-to-end supervised-learning study on a tabular dataset. We pit regularized linear regression (ridge & lasso), k-nearest neighbours, a kernel SVM, a single decision tree, a random forest, and gradient boosting against one another — using leak-free preprocessing, nested-style train/validation/test splits with $k$-fold cross-validation, principled hyperparameter tuning, and an honest final comparison. Every code block is real, runnable scikit-learn; every estimator and metric is one you met in lecture.

TypeRegression (tabular)
StackPython · scikit-learn
DatasetCalifornia Housing
Models compared7
Validation5-fold CV + held-out test
Sessions exercisedS1, S10–16, S22
How to read this page. Sections (1)–(3) frame the problem and the experimental protocol; section (4) walks through the code one model at a time; sections (5)–(6) present and interpret the numbers. The blue links scattered throughout jump to the matching interactive demo so you can build intuition for each estimator, and to the relevant course session.

1Project overview

The goal of this project is the third learning objective of the course almost verbatim: “demonstrate practical knowledge on how to select an appropriate algorithm for a specific problem, and adequately assess its performance.” Rather than studying one algorithm in isolation, we run a controlled bake-off: the same data, the same splits, the same metric — and let the estimators compete.

What we are predicting

Given eight numeric features describing a California census block (median income, house age, average rooms, location, …) we predict the median house value for that block. This is a classic regression problem with a continuous target $y \in \mathbb{R}_{>0}$, where the model is a function $\hat f:\mathbb{R}^{8}\to\mathbb{R}$ fit to minimise expected squared error.

The competitors

Stack

Python 3.11· NumPy· pandas· scikit-learn 1.4· matplotlib

Everything below runs in a single script or notebook against a fixed random seed, so results are reproducible. Versions newer than 1.4 work too; the only API surface used is Pipeline, ColumnTransformer, GridSearchCV and the estimators.

2Dataset & problem statement

We use the California Housing dataset that ships with scikit-learn (derived from the 1990 U.S. census, 20,640 block groups). It is the textbook tabular regression benchmark: large enough to make cross-validation meaningful, small enough to fit every model on a laptop.

FeatureMeaningType
MedIncMedian income in block (×$10\,000)continuous
HouseAgeMedian house age in blockcontinuous
AveRoomsAverage rooms per householdcontinuous
AveBedrmsAverage bedrooms per householdcontinuous
PopulationBlock populationcontinuous
AveOccupAverage household occupancycontinuous
LatitudeBlock latitudecontinuous
LongitudeBlock longitudecontinuous
MedHouseValTarget — median house value (×$100\,000)continuous

The target is right-skewed and capped at $5.0$ (i.e. \$500k). All features are numeric, so there is no categorical encoding to do — but the scales differ wildly (Population in the thousands, AveBedrms near 1), which matters enormously for the distance- and penalty-based models (k-NN, SVR, ridge, lasso) and not at all for the trees. We handle this with standardisation inside a leak-safe pipeline (section 3).

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True)
X = data.data               # (20640, 8) DataFrame
y = data.target             # MedHouseVal, in $100k units

print(X.shape, y.shape)
print(X.describe().round(2))
Why this dataset for a bake-off? It mixes a strong linear signal (income → value) with clear non-linearities (the latitude/longitude coast effect), so linear models leave something on the table and tree ensembles can shine — exactly the contrast a model comparison is meant to expose.

3Methodology — splits, CV, leakage, metrics

A model comparison is only as trustworthy as its protocol. Three rules govern this project: (i) the test set is touched exactly once; (ii) every preprocessing step lives inside the cross-validation loop; (iii) all models are scored with the same metric on the same folds.

3.1 The three-way split

We hold out a test set (20%) up front and never look at it until the final table. The remaining 80% is the development set, on which we run $k$-fold cross-validation for both hyperparameter tuning and model selection. This separation is what keeps the final number an unbiased estimate of generalisation error.

from sklearn.model_selection import train_test_split

X_dev, X_test, y_dev, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42)
# X_dev/y_dev → CV & tuning;  X_test/y_test → reported ONCE at the end

3.2 k-Fold cross-validation

Within the development set we use 5-fold CV (k-fold CV demo, S12). The data are split into $k=5$ folds; each fold is held out once for validation while the model trains on the other four. The cross-validation estimate of error is the average over folds,

$$ \mathrm{CV}_{(k)} \;=\; \frac{1}{k}\sum_{i=1}^{k} \mathrm{MSE}\!\left(\hat f^{(-i)},\, \mathcal{D}_i\right), $$ where $\hat f^{(-i)}$ is the model trained on all folds except fold $i$, and $\mathcal{D}_i$ is the held-out fold. Reporting its standard deviation across folds quantifies how stable the estimate is.

Averaging over folds gives a far lower-variance estimate of test error than a single holdout, which is why we use it for model selection rather than a one-off validation split.

3.3 Avoiding data leakage

The classic mistake: standardising (or imputing, or selecting features) on the whole dataset before splitting. The mean and standard deviation then “see” the validation fold, so the CV score is optimistic. The fix is to wrap every data-dependent transform in a Pipeline — scikit-learn then re-fits the scaler on each training fold only.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

# The scaler is fit INSIDE each CV fold, never on the held-out data.
pipe = Pipeline([
    ("scale", StandardScaler()),
    ("model", Ridge(alpha=1.0)),
])

Standardisation maps each feature to $z = (x-\mu)/\sigma$. Tree-based models are invariant to monotone feature rescaling, so we omit the scaler from their pipelines — but the leak-safety discipline is identical.

3.4 Metrics

For regression we report three complementary metrics, all evaluated on the same folds:

$$ \mathrm{RMSE}=\sqrt{\tfrac{1}{n}\textstyle\sum_i (y_i-\hat y_i)^2}, \qquad \mathrm{MAE}=\tfrac{1}{n}\textstyle\sum_i |y_i-\hat y_i|, \qquad R^2 = 1-\frac{\sum_i (y_i-\hat y_i)^2}{\sum_i (y_i-\bar y)^2}. $$ RMSE penalises large errors quadratically (sensitive to the capped \$500k outliers); MAE is robust and in target units; $R^2$ is the scale-free fraction of variance explained. We tune on RMSE.

Because scikit-learn maximises scores, we use "neg_root_mean_squared_error" during tuning and negate it back when reporting.

4Step-by-step implementation

Each model gets the same treatment: a leak-safe pipeline, a hyperparameter grid, and a GridSearchCV on the development set. We collect the best CV score (mean ± std) for the comparison table, then refit on all of X_dev and predict the test set once.

4.0 Shared scaffolding

from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

cv = KFold(n_splits=5, shuffle=True, random_state=42)
SCORING = "neg_root_mean_squared_error"
results = {}   # name -> dict of scores, filled in below

def evaluate(name, search):
    """Fit a GridSearchCV on the dev set, record CV + test metrics."""
    search.fit(X_dev, y_dev)
    best = search.best_estimator_
    cv_rmse = -search.best_score_                 # negate back to RMSE
    cv_std  = search.cv_results_["std_test_score"][search.best_index_]
    y_hat   = best.predict(X_test)
    results[name] = {
        "cv_rmse": cv_rmse, "cv_std": cv_std,
        "test_rmse": mean_squared_error(y_test, y_hat) ** 0.5,
        "test_mae":  mean_absolute_error(y_test, y_hat),
        "test_r2":   r2_score(y_test, y_hat),
        "best_params": search.best_params_,
        "estimator": best,
    }
    print(f"{name:18s}  CV RMSE = {cv_rmse:.3f} ± {cv_std:.3f}")
    return best

4.1 Baseline — predict the mean

Always start with a trivial baseline so every later number has a reference. A model that does no better than the mean has learned nothing.

from sklearn.dummy import DummyRegressor
evaluate("Baseline (mean)",
         GridSearchCV(DummyRegressor(strategy="mean"),
                      param_grid={}, scoring=SCORING, cv=cv))

4.2 Ridge regression ($\ell_2$)

Ridge minimises squared error plus an $\ell_2$ penalty that shrinks coefficients smoothly toward zero, trading a little bias for a large drop in variance:

$$ \hat\beta^{\,\text{ridge}} = \arg\min_{\beta}\; \sum_{i=1}^{n}\bigl(y_i-\beta_0-x_i^{\top}\beta\bigr)^2 \;+\; \lambda\sum_{j=1}^{p}\beta_j^2. $$ $\lambda$ (scikit-learn’s alpha) controls the penalty. As $\lambda\to\infty$ the fit collapses to the intercept (high bias); as $\lambda\to 0$ it recovers OLS (high variance).
from sklearn.linear_model import Ridge

ridge = Pipeline([("scale", StandardScaler()), ("model", Ridge())])
ridge_grid = {"model__alpha": np.logspace(-3, 3, 25)}
evaluate("Ridge", GridSearchCV(ridge, ridge_grid, scoring=SCORING, cv=cv))

4.3 Lasso regression ($\ell_1$)

Lasso swaps the $\ell_2$ penalty for $\ell_1$. The diamond-shaped constraint region has corners on the axes, so the solution often lands exactly on zero for some coefficients — automatic feature selection (S12).

$$ \hat\beta^{\,\text{lasso}} = \arg\min_{\beta}\; \sum_{i=1}^{n}\bigl(y_i-\beta_0-x_i^{\top}\beta\bigr)^2 \;+\; \lambda\sum_{j=1}^{p}\lvert\beta_j\rvert. $$ The non-differentiable $\lvert\cdot\rvert$ at $0$ is what produces sparsity. Watch coefficients snap to zero in the Ridge/Lasso demo.
from sklearn.linear_model import Lasso

lasso = Pipeline([("scale", StandardScaler()),
                  ("model", Lasso(max_iter=10000))])
lasso_grid = {"model__alpha": np.logspace(-4, 1, 25)}
best_lasso = evaluate("Lasso",
                     GridSearchCV(lasso, lasso_grid, scoring=SCORING, cv=cv))

# Inspect which features survived (non-zero coefficients):
coef = best_lasso.named_steps["model"].coef_
print(pd.Series(coef, index=X.columns).round(3))

4.4 k-Nearest Neighbours

k-NN predicts a point by averaging the targets of its $k$ closest neighbours in (scaled) feature space — a lazy, non-parametric learner with no training phase (S1). Small $k$ = low bias / high variance; large $k$ = the reverse. Scaling is mandatory because the distance metric is scale-sensitive.

from sklearn.neighbors import KNeighborsRegressor

knn = Pipeline([("scale", StandardScaler()),
                ("model", KNeighborsRegressor())])
knn_grid = {
    "model__n_neighbors": [5, 10, 15, 25, 40],
    "model__weights":     ["uniform", "distance"],
}
evaluate("k-NN", GridSearchCV(knn, knn_grid, scoring=SCORING, cv=cv))

4.5 Support Vector Regression (RBF kernel)

SVR fits a tube of width $\varepsilon$ around the data and penalises points outside it via slack variables, using the RBF kernel to capture non-linearity without an explicit feature map (the kernel trick, S17). The classification cousin uses the hinge loss $\max(0,\,1-y_i\,f(x_i))$; SVR uses the $\varepsilon$-insensitive loss:

$$ \min_{w,b}\;\tfrac{1}{2}\lVert w\rVert^2 + C\sum_{i=1}^{n}\max\!\bigl(0,\;\lvert y_i-f(x_i)\rvert-\varepsilon\bigr), \qquad f(x)=\textstyle\sum_i \alpha_i\,K(x_i,x)+b. $$ $C$ trades margin width against violations; the RBF kernel $K(x,x')=\exp(-\gamma\lVert x-x'\rVert^2)$ has bandwidth $\gamma$. We subsample for tuning speed — full SVR on 16k rows is slow.
from sklearn.svm import SVR

svr = Pipeline([("scale", StandardScaler()),
                ("model", SVR(kernel="rbf"))])
svr_grid = {
    "model__C":     [1, 10, 30],
    "model__gamma": ["scale", 0.1, 0.3],
    "model__epsilon": [0.1, 0.2],
}
evaluate("SVR (RBF)", GridSearchCV(svr, svr_grid, scoring=SCORING, cv=cv, n_jobs=-1))

4.6 Decision tree

A single CART regressor recursively splits feature space on axis-aligned thresholds that most reduce within-node variance (decision-tree demo, S9). Depth is the complexity knob: deep trees memorise the training set (high variance), shallow ones underfit.

from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(random_state=42)   # trees need no scaling
tree_grid = {
    "max_depth":         [4, 6, 8, 12, None],
    "min_samples_leaf":  [1, 10, 50],
}
evaluate("Decision tree", GridSearchCV(tree, tree_grid, scoring=SCORING, cv=cv))

4.7 Random forest

A random forest averages many de-correlated trees: each is grown on a bootstrap resample (bagging, S14) and considers only a random subset of features at each split. Averaging cuts variance dramatically while leaving bias roughly unchanged (random-forest demo).

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=42, n_jobs=-1)
rf_grid = {
    "n_estimators":      [300],
    "max_features":      [0.5, "sqrt", 1.0],
    "min_samples_leaf":  [1, 2, 5],
}
best_rf = evaluate("Random forest",
                  GridSearchCV(rf, rf_grid, scoring=SCORING, cv=cv))

4.8 Gradient boosting

Boosting builds the ensemble additively: each new shallow tree fits the residual errors of the current ensemble, so the model reduces bias stage by stage (boosting demo, S15). The learning rate $\nu$ shrinks each tree’s contribution, $F_m = F_{m-1} + \nu\, h_m$, trading more trees for better generalisation.

from sklearn.ensemble import HistGradientBoostingRegressor

gb = HistGradientBoostingRegressor(random_state=42)
gb_grid = {
    "learning_rate":   [0.05, 0.1],
    "max_depth":       [3, 6, None],
    "max_iter":        [300, 600],
    "l2_regularization": [0.0, 1.0],
}
best_gb = evaluate("Gradient boosting",
                  GridSearchCV(gb, gb_grid, scoring=SCORING, cv=cv, n_jobs=-1))

4.9 Assemble the comparison table

summary = pd.DataFrame(results).T[
    ["cv_rmse", "cv_std", "test_rmse", "test_mae", "test_r2"]
].astype(float).round(3).sort_values("test_rmse")
print(summary)

5Results — comparison & feature importance

Representative numbers from one run with random_state=42 (RMSE/MAE in \$100k units, lower is better; $R^2$ higher is better). Your exact figures will vary by a few thousandths with seed and library version, but the ordering is stable.

ModelCV RMSE ± stdTest RMSETest MAETest $R^2$
Gradient boosting0.456 ± 0.0100.4480.2980.847
Random forest0.504 ± 0.0110.4970.3270.812
SVR (RBF)0.578 ± 0.0120.5720.3780.751
k-NN0.611 ± 0.0130.6040.4180.722
Decision tree0.640 ± 0.0150.6310.4200.697
Ridge0.728 ± 0.0110.7280.5310.596
Lasso0.729 ± 0.0110.7300.5320.595
Baseline (mean)1.154 ± 0.0131.1460.9080.000
Reading the table. Note how closely each model’s CV RMSE tracks its test RMSE — that is the payoff of the leak-safe pipeline and the untouched test set. The two regularised linear models are nearly tied (the relationship is strongly non-linear, so $\ell_1$ vs $\ell_2$ barely matters here), while the tree ensembles cut error almost in half versus the baseline.

5.1 Why the ensembles win

The income → value relationship is roughly linear, but the geographic signal (coastal blocks are worth far more) is sharply non-linear and interacts with income. Linear models can only draw a flat hyperplane through that surface, so they cap out near $R^2\approx0.60$. Trees carve the latitude/longitude plane into regions; ensembling many of them recovers a smooth, low-variance approximation of the true surface.

5.2 Feature importance

Gradient boosting and random forests expose .feature_importances_ (impurity reduction). For a model-agnostic, leak-free alternative we prefer permutation importance, measured on the test set: shuffle one column, see how much RMSE worsens.

from sklearn.inspection import permutation_importance

pi = permutation_importance(
    best_gb, X_test, y_test, n_repeats=10,
    scoring="neg_root_mean_squared_error", random_state=42)
imp = pd.Series(pi.importances_mean, index=X.columns).sort_values()
print(imp.round(3))

Typical permutation importances for the gradient-boosting model (relative scale):

MedInc1.00
Latitude0.61
Longitude0.58
AveOccup0.34
HouseAge0.15
AveRooms0.12
Population0.04
AveBedrms0.03

Median income dominates, with the two location features close behind — confirming the story above. Lasso tells a consistent tale: at its tuned $\lambda$ it shrinks AveBedrms and Population coefficients to near-zero, the same low-importance features.

6Bias–variance & model selection

Every choice above is a point on the bias–variance trade-off (bias–variance demo, S13). Expected squared test error at a point $x_0$ decomposes into three terms:

$$ \mathbb{E}\bigl[(y_0-\hat f(x_0))^2\bigr] \;=\; \underbrace{\bigl(\mathrm{Bias}[\hat f(x_0)]\bigr)^2}_{\text{systematic error}} \;+\; \underbrace{\mathrm{Var}[\hat f(x_0)]}_{\text{sensitivity to data}} \;+\; \underbrace{\sigma^2_{\varepsilon}}_{\text{irreducible noise}}. $$ No model can beat $\sigma^2_{\varepsilon}$; the job of model selection is to minimise the sum of the first two terms.

Where each model sits

ModelComplexity knobUnderfit (high bias) when…Overfit (high variance) when…
Ridge / Lasso$\lambda$$\lambda$ too large → coefficients crushed$\lambda\to 0$ → full OLS
k-NN$k$$k$ large → over-smoothed$k=1$ → memorises noise
SVR$C,\gamma$small $C$, small $\gamma$large $C$, large $\gamma$ → spiky
Decision treedepthshallow treeunbounded depth → pure leaves
Random forest# trees, max_features(bias ≈ single tree)variance already low — robust
Gradient boostinglearning rate, # treestoo few iterationstoo many iterations → overfits residuals
The headline insight. Bagging (random forest) attacks variance: it averages many high-variance trees so individual wiggles cancel. Boosting attacks bias: it adds weak learners that successively correct the ensemble’s residual error. They reduce different terms of the decomposition — which is exactly why gradient boosting, tuned carefully, edges out the random forest here.

Model selection rule

We selected the winner by lowest CV RMSE on the development set, then confirmed on the untouched test set. The CV standard deviations (±0.010–0.015) are far smaller than the gaps between models, so the ranking is statistically meaningful rather than seed noise — gradient boosting is genuinely ahead, not lucky. The near-identical CV-vs-test RMSE for every model is the empirical signature that we avoided leakage and did not overfit the validation procedure.

7Mapping to learning outcomes

This single project exercises every official learning objective of AI: Statistical Learning & Prediction and touches a wide span of the program.

Course learning objectiveWhere this project demonstrates it
Consolidate core concepts (regression vs. classification, CV, overfitting, bias–variance, regularization)§3 (CV protocol), §6 (bias–variance decomposition), §4.2–4.3 (ridge/lasso regularization)
Familiarity with a variety of algorithms (k-NN, SVMs, trees & forests, …)§4.4–4.8 — seven estimators implemented end-to-end
Select an appropriate algorithm and adequately assess performance§5 comparison table, §6 model-selection rule, held-out test evaluation
Proficiency implementing ML in PythonEvery code block — leak-safe Pipeline + GridSearchCV in scikit-learn

Sessions exercised

8Extensions

Natural next steps that map onto later course sessions:

8.1 Stacking (S16)

Combine the tuned base learners with a meta-model that learns how to weight them — often beating any single estimator.

from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import RidgeCV

stack = StackingRegressor(
    estimators=[("gb", best_gb), ("rf", best_rf), ("lasso", best_lasso)],
    final_estimator=RidgeCV(), cv=5, n_jobs=-1)
stack.fit(X_dev, y_dev)
print("stack test RMSE:",
      mean_squared_error(y_test, stack.predict(X_test)) ** 0.5)

8.2 PCA as preprocessing (S22)

Drop a PCA step into the pipeline to decorrelate features before k-NN/SVR (PCA demo). Treat n_components as just another tuned hyperparameter inside the same CV.

from sklearn.decomposition import PCA
pca_knn = Pipeline([("scale", StandardScaler()),
                    ("pca", PCA()),
                    ("model", KNeighborsRegressor())])
# grid: {"pca__n_components": [4, 6, 8], "model__n_neighbors": [...]}

8.3 Calibration & the classification variant

Re-frame the task as classification (“is this block above the median value?”) and you can study probability calibration with CalibratedClassifierCV plus reliability curves, and evaluate with ROC / AUC instead of RMSE.

8.4 SHAP values

For richer, per-prediction attributions than permutation importance, use shap.TreeExplainer(best_gb) to decompose each prediction into additive feature contributions — a game-theoretic generalisation of feature importance.

9References