An end-to-end supervised-learning study on a tabular dataset. We pit regularized linear regression (ridge & lasso), k-nearest neighbours, a kernel SVM, a single decision tree, a random forest, and gradient boosting against one another — using leak-free preprocessing, nested-style train/validation/test splits with $k$-fold cross-validation, principled hyperparameter tuning, and an honest final comparison. Every code block is real, runnable scikit-learn; every estimator and metric is one you met in lecture.
The goal of this project is the third learning objective of the course almost verbatim: “demonstrate practical knowledge on how to select an appropriate algorithm for a specific problem, and adequately assess its performance.” Rather than studying one algorithm in isolation, we run a controlled bake-off: the same data, the same splits, the same metric — and let the estimators compete.
Given eight numeric features describing a California census block (median income, house age, average rooms, location, …) we predict the median house value for that block. This is a classic regression problem with a continuous target $y \in \mathbb{R}_{>0}$, where the model is a function $\hat f:\mathbb{R}^{8}\to\mathbb{R}$ fit to minimise expected squared error.
Everything below runs in a single script or notebook against a fixed random seed, so results are reproducible. Versions newer than 1.4 work too; the only API surface used is Pipeline, ColumnTransformer, GridSearchCV and the estimators.
We use the California Housing dataset that ships with scikit-learn (derived from the 1990 U.S. census, 20,640 block groups). It is the textbook tabular regression benchmark: large enough to make cross-validation meaningful, small enough to fit every model on a laptop.
| Feature | Meaning | Type |
|---|---|---|
| MedInc | Median income in block (×$10\,000) | continuous |
| HouseAge | Median house age in block | continuous |
| AveRooms | Average rooms per household | continuous |
| AveBedrms | Average bedrooms per household | continuous |
| Population | Block population | continuous |
| AveOccup | Average household occupancy | continuous |
| Latitude | Block latitude | continuous |
| Longitude | Block longitude | continuous |
| MedHouseVal | Target — median house value (×$100\,000) | continuous |
The target is right-skewed and capped at $5.0$ (i.e. \$500k). All features are numeric, so there is no categorical encoding to do — but the scales differ wildly (Population in the thousands, AveBedrms near 1), which matters enormously for the distance- and penalty-based models (k-NN, SVR, ridge, lasso) and not at all for the trees. We handle this with standardisation inside a leak-safe pipeline (section 3).
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True)
X = data.data # (20640, 8) DataFrame
y = data.target # MedHouseVal, in $100k units
print(X.shape, y.shape)
print(X.describe().round(2))
A model comparison is only as trustworthy as its protocol. Three rules govern this project: (i) the test set is touched exactly once; (ii) every preprocessing step lives inside the cross-validation loop; (iii) all models are scored with the same metric on the same folds.
We hold out a test set (20%) up front and never look at it until the final table. The remaining 80% is the development set, on which we run $k$-fold cross-validation for both hyperparameter tuning and model selection. This separation is what keeps the final number an unbiased estimate of generalisation error.
from sklearn.model_selection import train_test_split
X_dev, X_test, y_dev, y_test = train_test_split(
X, y, test_size=0.20, random_state=42)
# X_dev/y_dev → CV & tuning; X_test/y_test → reported ONCE at the end
Within the development set we use 5-fold CV (k-fold CV demo, S12). The data are split into $k=5$ folds; each fold is held out once for validation while the model trains on the other four. The cross-validation estimate of error is the average over folds,
Averaging over folds gives a far lower-variance estimate of test error than a single holdout, which is why we use it for model selection rather than a one-off validation split.
Pipeline — scikit-learn then re-fits the scaler on each training fold only.from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
# The scaler is fit INSIDE each CV fold, never on the held-out data.
pipe = Pipeline([
("scale", StandardScaler()),
("model", Ridge(alpha=1.0)),
])
Standardisation maps each feature to $z = (x-\mu)/\sigma$. Tree-based models are invariant to monotone feature rescaling, so we omit the scaler from their pipelines — but the leak-safety discipline is identical.
For regression we report three complementary metrics, all evaluated on the same folds:
Because scikit-learn maximises scores, we use "neg_root_mean_squared_error" during tuning and negate it back when reporting.
Each model gets the same treatment: a leak-safe pipeline, a hyperparameter grid, and a GridSearchCV on the development set. We collect the best CV score (mean ± std) for the comparison table, then refit on all of X_dev and predict the test set once.
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
cv = KFold(n_splits=5, shuffle=True, random_state=42)
SCORING = "neg_root_mean_squared_error"
results = {} # name -> dict of scores, filled in below
def evaluate(name, search):
"""Fit a GridSearchCV on the dev set, record CV + test metrics."""
search.fit(X_dev, y_dev)
best = search.best_estimator_
cv_rmse = -search.best_score_ # negate back to RMSE
cv_std = search.cv_results_["std_test_score"][search.best_index_]
y_hat = best.predict(X_test)
results[name] = {
"cv_rmse": cv_rmse, "cv_std": cv_std,
"test_rmse": mean_squared_error(y_test, y_hat) ** 0.5,
"test_mae": mean_absolute_error(y_test, y_hat),
"test_r2": r2_score(y_test, y_hat),
"best_params": search.best_params_,
"estimator": best,
}
print(f"{name:18s} CV RMSE = {cv_rmse:.3f} ± {cv_std:.3f}")
return best
Always start with a trivial baseline so every later number has a reference. A model that does no better than the mean has learned nothing.
from sklearn.dummy import DummyRegressor
evaluate("Baseline (mean)",
GridSearchCV(DummyRegressor(strategy="mean"),
param_grid={}, scoring=SCORING, cv=cv))
Ridge minimises squared error plus an $\ell_2$ penalty that shrinks coefficients smoothly toward zero, trading a little bias for a large drop in variance:
alpha) controls the penalty. As $\lambda\to\infty$ the fit collapses to the intercept (high bias); as $\lambda\to 0$ it recovers OLS (high variance).
from sklearn.linear_model import Ridge
ridge = Pipeline([("scale", StandardScaler()), ("model", Ridge())])
ridge_grid = {"model__alpha": np.logspace(-3, 3, 25)}
evaluate("Ridge", GridSearchCV(ridge, ridge_grid, scoring=SCORING, cv=cv))
Lasso swaps the $\ell_2$ penalty for $\ell_1$. The diamond-shaped constraint region has corners on the axes, so the solution often lands exactly on zero for some coefficients — automatic feature selection (S12).
from sklearn.linear_model import Lasso
lasso = Pipeline([("scale", StandardScaler()),
("model", Lasso(max_iter=10000))])
lasso_grid = {"model__alpha": np.logspace(-4, 1, 25)}
best_lasso = evaluate("Lasso",
GridSearchCV(lasso, lasso_grid, scoring=SCORING, cv=cv))
# Inspect which features survived (non-zero coefficients):
coef = best_lasso.named_steps["model"].coef_
print(pd.Series(coef, index=X.columns).round(3))
k-NN predicts a point by averaging the targets of its $k$ closest neighbours in (scaled) feature space — a lazy, non-parametric learner with no training phase (S1). Small $k$ = low bias / high variance; large $k$ = the reverse. Scaling is mandatory because the distance metric is scale-sensitive.
from sklearn.neighbors import KNeighborsRegressor
knn = Pipeline([("scale", StandardScaler()),
("model", KNeighborsRegressor())])
knn_grid = {
"model__n_neighbors": [5, 10, 15, 25, 40],
"model__weights": ["uniform", "distance"],
}
evaluate("k-NN", GridSearchCV(knn, knn_grid, scoring=SCORING, cv=cv))
SVR fits a tube of width $\varepsilon$ around the data and penalises points outside it via slack variables, using the RBF kernel to capture non-linearity without an explicit feature map (the kernel trick, S17). The classification cousin uses the hinge loss $\max(0,\,1-y_i\,f(x_i))$; SVR uses the $\varepsilon$-insensitive loss:
from sklearn.svm import SVR
svr = Pipeline([("scale", StandardScaler()),
("model", SVR(kernel="rbf"))])
svr_grid = {
"model__C": [1, 10, 30],
"model__gamma": ["scale", 0.1, 0.3],
"model__epsilon": [0.1, 0.2],
}
evaluate("SVR (RBF)", GridSearchCV(svr, svr_grid, scoring=SCORING, cv=cv, n_jobs=-1))
A single CART regressor recursively splits feature space on axis-aligned thresholds that most reduce within-node variance (decision-tree demo, S9). Depth is the complexity knob: deep trees memorise the training set (high variance), shallow ones underfit.
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor(random_state=42) # trees need no scaling
tree_grid = {
"max_depth": [4, 6, 8, 12, None],
"min_samples_leaf": [1, 10, 50],
}
evaluate("Decision tree", GridSearchCV(tree, tree_grid, scoring=SCORING, cv=cv))
A random forest averages many de-correlated trees: each is grown on a bootstrap resample (bagging, S14) and considers only a random subset of features at each split. Averaging cuts variance dramatically while leaving bias roughly unchanged (random-forest demo).
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
rf_grid = {
"n_estimators": [300],
"max_features": [0.5, "sqrt", 1.0],
"min_samples_leaf": [1, 2, 5],
}
best_rf = evaluate("Random forest",
GridSearchCV(rf, rf_grid, scoring=SCORING, cv=cv))
Boosting builds the ensemble additively: each new shallow tree fits the residual errors of the current ensemble, so the model reduces bias stage by stage (boosting demo, S15). The learning rate $\nu$ shrinks each tree’s contribution, $F_m = F_{m-1} + \nu\, h_m$, trading more trees for better generalisation.
from sklearn.ensemble import HistGradientBoostingRegressor
gb = HistGradientBoostingRegressor(random_state=42)
gb_grid = {
"learning_rate": [0.05, 0.1],
"max_depth": [3, 6, None],
"max_iter": [300, 600],
"l2_regularization": [0.0, 1.0],
}
best_gb = evaluate("Gradient boosting",
GridSearchCV(gb, gb_grid, scoring=SCORING, cv=cv, n_jobs=-1))
summary = pd.DataFrame(results).T[
["cv_rmse", "cv_std", "test_rmse", "test_mae", "test_r2"]
].astype(float).round(3).sort_values("test_rmse")
print(summary)
Representative numbers from one run with random_state=42 (RMSE/MAE in \$100k units, lower is better; $R^2$ higher is better). Your exact figures will vary by a few thousandths with seed and library version, but the ordering is stable.
| Model | CV RMSE ± std | Test RMSE | Test MAE | Test $R^2$ |
|---|---|---|---|---|
| Gradient boosting | 0.456 ± 0.010 | 0.448 | 0.298 | 0.847 |
| Random forest | 0.504 ± 0.011 | 0.497 | 0.327 | 0.812 |
| SVR (RBF) | 0.578 ± 0.012 | 0.572 | 0.378 | 0.751 |
| k-NN | 0.611 ± 0.013 | 0.604 | 0.418 | 0.722 |
| Decision tree | 0.640 ± 0.015 | 0.631 | 0.420 | 0.697 |
| Ridge | 0.728 ± 0.011 | 0.728 | 0.531 | 0.596 |
| Lasso | 0.729 ± 0.011 | 0.730 | 0.532 | 0.595 |
| Baseline (mean) | 1.154 ± 0.013 | 1.146 | 0.908 | 0.000 |
The income → value relationship is roughly linear, but the geographic signal (coastal blocks are worth far more) is sharply non-linear and interacts with income. Linear models can only draw a flat hyperplane through that surface, so they cap out near $R^2\approx0.60$. Trees carve the latitude/longitude plane into regions; ensembling many of them recovers a smooth, low-variance approximation of the true surface.
Gradient boosting and random forests expose .feature_importances_ (impurity reduction). For a model-agnostic, leak-free alternative we prefer permutation importance, measured on the test set: shuffle one column, see how much RMSE worsens.
from sklearn.inspection import permutation_importance
pi = permutation_importance(
best_gb, X_test, y_test, n_repeats=10,
scoring="neg_root_mean_squared_error", random_state=42)
imp = pd.Series(pi.importances_mean, index=X.columns).sort_values()
print(imp.round(3))
Typical permutation importances for the gradient-boosting model (relative scale):
Median income dominates, with the two location features close behind — confirming the story above. Lasso tells a consistent tale: at its tuned $\lambda$ it shrinks AveBedrms and Population coefficients to near-zero, the same low-importance features.
Every choice above is a point on the bias–variance trade-off (bias–variance demo, S13). Expected squared test error at a point $x_0$ decomposes into three terms:
| Model | Complexity knob | Underfit (high bias) when… | Overfit (high variance) when… |
|---|---|---|---|
| Ridge / Lasso | $\lambda$ | $\lambda$ too large → coefficients crushed | $\lambda\to 0$ → full OLS |
| k-NN | $k$ | $k$ large → over-smoothed | $k=1$ → memorises noise |
| SVR | $C,\gamma$ | small $C$, small $\gamma$ | large $C$, large $\gamma$ → spiky |
| Decision tree | depth | shallow tree | unbounded depth → pure leaves |
| Random forest | # trees, max_features | (bias ≈ single tree) | variance already low — robust |
| Gradient boosting | learning rate, # trees | too few iterations | too many iterations → overfits residuals |
We selected the winner by lowest CV RMSE on the development set, then confirmed on the untouched test set. The CV standard deviations (±0.010–0.015) are far smaller than the gaps between models, so the ranking is statistically meaningful rather than seed noise — gradient boosting is genuinely ahead, not lucky. The near-identical CV-vs-test RMSE for every model is the empirical signature that we avoided leakage and did not overfit the validation procedure.
This single project exercises every official learning objective of AI: Statistical Learning & Prediction and touches a wide span of the program.
| Course learning objective | Where this project demonstrates it |
|---|---|
| Consolidate core concepts (regression vs. classification, CV, overfitting, bias–variance, regularization) | §3 (CV protocol), §6 (bias–variance decomposition), §4.2–4.3 (ridge/lasso regularization) |
| Familiarity with a variety of algorithms (k-NN, SVMs, trees & forests, …) | §4.4–4.8 — seven estimators implemented end-to-end |
| Select an appropriate algorithm and adequately assess performance | §5 comparison table, §6 model-selection rule, held-out test evaluation |
| Proficiency implementing ML in Python | Every code block — leak-safe Pipeline + GridSearchCV in scikit-learn |
Natural next steps that map onto later course sessions:
Combine the tuned base learners with a meta-model that learns how to weight them — often beating any single estimator.
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import RidgeCV
stack = StackingRegressor(
estimators=[("gb", best_gb), ("rf", best_rf), ("lasso", best_lasso)],
final_estimator=RidgeCV(), cv=5, n_jobs=-1)
stack.fit(X_dev, y_dev)
print("stack test RMSE:",
mean_squared_error(y_test, stack.predict(X_test)) ** 0.5)
Drop a PCA step into the pipeline to decorrelate features before k-NN/SVR (PCA demo). Treat n_components as just another tuned hyperparameter inside the same CV.
from sklearn.decomposition import PCA
pca_knn = Pipeline([("scale", StandardScaler()),
("pca", PCA()),
("model", KNeighborsRegressor())])
# grid: {"pca__n_components": [4, 6, 8], "model__n_neighbors": [...]}
Re-frame the task as classification (“is this block above the median value?”) and you can study probability calibration with CalibratedClassifierCV plus reliability curves, and evaluate with ROC / AUC instead of RMSE.
For richer, per-prediction attributions than permutation importance, use shap.TreeExplainer(best_gb) to decompose each prediction into additive feature contributions — a game-theoretic generalisation of feature importance.
GridSearchCV, model evaluation, permutation importance. scikit-learn.org ↗