Statistical Learning & Prediction

An interactive companion to the SLP course. Every section below is a live, click-and-drag visualisation of a core concept — from k-NN and bias/variance up to kernel SVMs, ensembles, and PCA. Drag points, move sliders, watch the model react.

k-NN Linear Reg Logistic Reg Ridge / Lasso Bias–Variance k-Fold CV ROC / AUC Gradient Descent Backprop Decision Trees Random Forest AdaBoost Kernels SVM PCA k-Means Hierarchical Anomaly

How to use

Click on any canvas to add data points.
Drag sliders to change hyper-parameters; the model re-fits live.
Use Reset / Resample buttons in each demo.
Two-class problems use blue and pink.

Topics covered

k-NN · linear/ridge/lasso regression · bias–variance · gradient descent · softmax + SGD · activation functions · MLP + backprop · decision trees · bagging · random forest · AdaBoost · kernel trick · soft-margin SVM · PCA · MDS · k-means

k-Nearest Neighbours

A lazy, non-parametric classifier: to predict a point's class, look at its k nearest training points and take a (weighted) majority vote. No training — all the work happens at query time.

k = 5

Distance

Weighted vote

Add as class

Click on the canvas to add training points. The shaded background is the kNN decision region for the current k.

Intuition

Small k ⇒ jagged, high-variance decision boundary that fits noise. Large k ⇒ smoother but biased boundary. Plot a validation curve over k to choose. Distance metric and feature scaling matter a lot.

Linear Regression

Fit ŷ = w·x + b by minimising mean-squared error. Drag points around — the OLS solution updates in closed form.

Degree = 1

—train MSE

—R²

Click to add points; drag existing points to move them. Increasing polynomial degree shows over-fitting — high degree memorises noise.

Loss Functions

A loss measures how wrong a single prediction is. Different losses give very different sensitivities to outliers and large errors.

δ (Huber) = 1.0

L2 (squared) L1 (absolute) Huber 0/1 (classification) hinge log-loss

L2 punishes large errors quadratically (sensitive to outliers). L1 is robust but non-differentiable at 0. Huber blends both. Hinge and log-loss are the SVM and logistic regression losses.

Regularization — Ridge & Lasso

Adding a penalty on the coefficient norm shrinks weights, trading bias for reduced variance. L2 (ridge) shrinks smoothly; L1 (lasso) zeroes coefficients — it's a feature selector.

log₁₀ λ = -3

Penalty

Degree = 9

Coefficient bar chart

Watch coefficients shrink as λ grows. With lasso, many bars snap to exactly zero — that's sparsity.

Bias–Variance Tradeoff

Expected test error decomposes into bias² + variance + noise. Simple models have high bias; complex ones have high variance. The sweet spot minimises their sum.

Each thin line is a polynomial fit on a different random sample. Wider spread ⇒ higher variance.

Train vs. test error as model complexity grows.

Degree = 3

Noise σ = 0.30

Samples / fit = 25

Gradient Descent

Walk downhill on a loss surface by stepping opposite to the gradient. Step-size (learning rate) makes or breaks it.

Learning rate η = 0.10

Momentum β = 0.0

Surface

Click on the surface to set the starting point. Too-large η explodes; momentum smooths out ravines.

Softmax Regression with SGD

Multinomial logistic regression: outputs a probability over K classes via softmax, trained by minimising cross-entropy with mini-batch SGD.

Classes = 3

η = 0.10

Batch = 16

—cross-entropy

—accuracy

Click to add a point of the currently selected class. Linear decision boundaries — softmax can only separate linearly.

Activation Functions

The non-linearity inside each neuron. Choosing the right one controls saturation, dead units, and gradient flow.

f(x)f'(x)

Sigmoid & tanh saturate ⇒ vanishing gradients. ReLU is cheap but can die. LeakyReLU/ELU/GELU/Swish are common modern choices.

MLP & Backpropagation

A small fully-connected net trained live on a 2-D classification task. Backprop is just chained partial derivatives — watch the boundary curve as gradients flow.

Hidden layers

Activation

η = 0.05

Dataset

—loss

0epochs

Linear models can't solve XOR or spirals — adding hidden units carves curved boundaries.

Decision Trees

Recursively split feature space along axis-aligned cuts that minimise impurity (Gini / entropy). Each leaf predicts the majority class within its rectangle.

Tree structure — node labels show split feature & threshold.

Max depth = 4

Min leaf = 1

Criterion

Add as

Click to add labelled points. Deeper trees fit perfectly on training data — that's over-fitting.

Bagging — Bootstrap Aggregation

Train many high-variance learners on bootstrap resamples of the data, then average. Variance ↓, bias ≈ unchanged.

# trees = 15

Tree depth = 6

Show

A single deep tree is wiggly. Averaging many trees on different bootstraps gives a smooth, lower-variance boundary.

Random Forest

Bagging + random feature subset at each split — de-correlates the trees, reducing ensemble variance further.

# trees = 25

Tree depth = 6

Features / split

In 2-D there are only 2 features, so "feature bagging" is dramatic — boundaries get rounder.

Boosting — AdaBoost

Sequentially fit weak learners (decision stumps) on re-weighted data — misclassified samples get more attention each round.

Round T = 1

—train err

Each stump (vertical/horizontal line) is shown faintly. Sample dot size shows current weight — hard examples grow.

The Kernel Trick

Non-linearly separable data in low dimensions often becomes linearly separable when lifted to higher dimensions. Kernels compute that inner product implicitly — no explicit feature map.

2-D view (input space)

3-D view: φ(x,y) = (x, y, x²+y²) — a paraboloid lift.

Rotate = 25°

Lift α = 1.0

Support Vector Machine

Maximum-margin linear classifier with slack (soft margin). Only the support vectors determine the boundary.

C = 1.0

Kernel

RBF γ = 1.0

Add as

Support vectors are circled. With RBF kernel you get curved boundaries — high γ means very local, low γ means smooth.

Principal Component Analysis

Find orthogonal axes that capture the most variance. Drag points — the principal components rotate to follow the data's spread.

Original 2-D data with PC1 (red) and PC2 (cyan).

Projection onto chosen number of components.

Keep components

Variance explained—

Click to add points or drag existing ones. With k=1 you see the lossy 1-D reconstruction.

Multidimensional Scaling

Embed objects into 2-D such that pairwise Euclidean distances best match a given dissimilarity matrix. Useful when you only have similarities, not coordinates.

Dataset

—stress

Stress is the squared mismatch between target distances and current 2-D distances. SMACOF-style updates pull and push points until it settles.

k-Means Clustering

Pick k centroids, assign each point to its nearest one, recompute centroids — repeat until stable. Greedy local-minimum descent on within-cluster sum-of-squares.

k = 3

Init

0iter

—WCSS

Click to add a custom point. Bad initial centroids → bad local minimum; try k-means++.

Logistic Regression

Linear classifier with a sigmoid squash. Fit by maximising likelihood (= minimising binary cross-entropy). Output is a probability — perfect when you need calibrated scores.

2-D classifier. White line is the 0.5 boundary; pink/blue gradient is P(y=1|x).

σ(z) curve with marker at the currently hovered point's z = w·x+b.

η = 0.10

L2 = 0.00

Add as

—BCE

—acc

k-Fold Cross-Validation

Split the data into k folds; train on k-1 and validate on the held-out one. Rotate. The averaged validation error is a less-noisy estimate of test error than a single holdout split.

k = 5

Current fold = 0

Polynomial deg = 3

—fold MSE

—CV MSE

Held-out fold shown in orange. Train MSE on blue, validation MSE on orange. Sweep degree to find the bias/variance sweet spot.

ROC, PR & Confusion Matrix

A trained classifier outputs a score; sliding the decision threshold trades false positives against false negatives. ROC plots TPR vs FPR; PR plots precision vs recall.

Class score histograms — drag the τ slider to move the threshold.

ROC curve. AUC printed below. Current threshold marked.

Precision-Recall curve.

Threshold τ = 0.50

Class separation = 2.0

Confusion matrix at τ

Accuracy — Precision — Recall — F1 — AUC —

Hierarchical Clustering

Agglomerative: start with each point as its own cluster, repeatedly merge the closest pair. The merge order yields a dendrogram — cut at any height to get a flat clustering.

Click to add points. Lines show last merges.

Dendrogram — drag the horizontal cut line via the slider.

Linkage

Cut height = 50

—clusters

Anomaly Detection

Learn what "normal" looks like, then flag points the model finds surprising. Three classic flavours: Gaussian likelihood, Mahalanobis distance, and a one-class radius around the densest region.

Method

Contamination = 5%

k (knn) = 5

—flagged

Click to add a single point and see if it's flagged. Heat-map shows the anomaly score across input space; circled points are above the contamination cutoff.