Statistical Learning & Prediction

An interactive companion to the SLP course. Every section below is a live, click-and-drag visualisation of a core concept — from k-NN and bias/variance up to kernel SVMs, ensembles, and PCA. Drag points, move sliders, watch the model react.

How to use

  • Click on any canvas to add data points.
  • Drag sliders to change hyper-parameters; the model re-fits live.
  • Use Reset / Resample buttons in each demo.
  • Two-class problems use blue and pink.

Topics covered

k-NN · linear/ridge/lasso regression · bias–variance · gradient descent · softmax + SGD · activation functions · MLP + backprop · decision trees · bagging · random forest · AdaBoost · kernel trick · soft-margin SVM · PCA · MDS · k-means

k-Nearest Neighbours

A lazy, non-parametric classifier: to predict a point's class, look at its k nearest training points and take a (weighted) majority vote. No training — all the work happens at query time.

Click on the canvas to add training points. The shaded background is the kNN decision region for the current k.

Intuition

Small k ⇒ jagged, high-variance decision boundary that fits noise. Large k ⇒ smoother but biased boundary. Plot a validation curve over k to choose. Distance metric and feature scaling matter a lot.

Linear Regression

Fit ŷ = w·x + b by minimising mean-squared error. Drag points around — the OLS solution updates in closed form.

train MSE
Click to add points; drag existing points to move them. Increasing polynomial degree shows over-fitting — high degree memorises noise.

Loss Functions

A loss measures how wrong a single prediction is. Different losses give very different sensitivities to outliers and large errors.

L2 (squared) L1 (absolute) Huber 0/1 (classification) hinge log-loss

L2 punishes large errors quadratically (sensitive to outliers). L1 is robust but non-differentiable at 0. Huber blends both. Hinge and log-loss are the SVM and logistic regression losses.

Regularization — Ridge & Lasso

Adding a penalty on the coefficient norm shrinks weights, trading bias for reduced variance. L2 (ridge) shrinks smoothly; L1 (lasso) zeroes coefficients — it's a feature selector.

Coefficient bar chart

Watch coefficients shrink as λ grows. With lasso, many bars snap to exactly zero — that's sparsity.

Bias–Variance Tradeoff

Expected test error decomposes into bias² + variance + noise. Simple models have high bias; complex ones have high variance. The sweet spot minimises their sum.

Each thin line is a polynomial fit on a different random sample. Wider spread ⇒ higher variance.
Train vs. test error as model complexity grows.

Gradient Descent

Walk downhill on a loss surface by stepping opposite to the gradient. Step-size (learning rate) makes or breaks it.

Click on the surface to set the starting point. Too-large η explodes; momentum smooths out ravines.

Softmax Regression with SGD

Multinomial logistic regression: outputs a probability over K classes via softmax, trained by minimising cross-entropy with mini-batch SGD.

cross-entropy
accuracy
Click to add a point of the currently selected class. Linear decision boundaries — softmax can only separate linearly.

Activation Functions

The non-linearity inside each neuron. Choosing the right one controls saturation, dead units, and gradient flow.

f(x)f'(x)

Sigmoid & tanh saturate ⇒ vanishing gradients. ReLU is cheap but can die. LeakyReLU/ELU/GELU/Swish are common modern choices.

MLP & Backpropagation

A small fully-connected net trained live on a 2-D classification task. Backprop is just chained partial derivatives — watch the boundary curve as gradients flow.

loss
0epochs
Linear models can't solve XOR or spirals — adding hidden units carves curved boundaries.

Decision Trees

Recursively split feature space along axis-aligned cuts that minimise impurity (Gini / entropy). Each leaf predicts the majority class within its rectangle.

Tree structure — node labels show split feature & threshold.
Click to add labelled points. Deeper trees fit perfectly on training data — that's over-fitting.

Bagging — Bootstrap Aggregation

Train many high-variance learners on bootstrap resamples of the data, then average. Variance ↓, bias ≈ unchanged.

A single deep tree is wiggly. Averaging many trees on different bootstraps gives a smooth, lower-variance boundary.

Random Forest

Bagging + random feature subset at each split — de-correlates the trees, reducing ensemble variance further.

In 2-D there are only 2 features, so "feature bagging" is dramatic — boundaries get rounder.

Boosting — AdaBoost

Sequentially fit weak learners (decision stumps) on re-weighted data — misclassified samples get more attention each round.

train err
Each stump (vertical/horizontal line) is shown faintly. Sample dot size shows current weight — hard examples grow.

The Kernel Trick

Non-linearly separable data in low dimensions often becomes linearly separable when lifted to higher dimensions. Kernels compute that inner product implicitly — no explicit feature map.

2-D view (input space)
3-D view: φ(x,y) = (x, y, x²+y²) — a paraboloid lift.

Support Vector Machine

Maximum-margin linear classifier with slack (soft margin). Only the support vectors determine the boundary.

Support vectors are circled. With RBF kernel you get curved boundaries — high γ means very local, low γ means smooth.

Principal Component Analysis

Find orthogonal axes that capture the most variance. Drag points — the principal components rotate to follow the data's spread.

Original 2-D data with PC1 (red) and PC2 (cyan).
Projection onto chosen number of components.
Click to add points or drag existing ones. With k=1 you see the lossy 1-D reconstruction.

Multidimensional Scaling

Embed objects into 2-D such that pairwise Euclidean distances best match a given dissimilarity matrix. Useful when you only have similarities, not coordinates.

stress
Stress is the squared mismatch between target distances and current 2-D distances. SMACOF-style updates pull and push points until it settles.

k-Means Clustering

Pick k centroids, assign each point to its nearest one, recompute centroids — repeat until stable. Greedy local-minimum descent on within-cluster sum-of-squares.

0iter
WCSS
Click to add a custom point. Bad initial centroids → bad local minimum; try k-means++.

Logistic Regression

Linear classifier with a sigmoid squash. Fit by maximising likelihood (= minimising binary cross-entropy). Output is a probability — perfect when you need calibrated scores.

2-D classifier. White line is the 0.5 boundary; pink/blue gradient is P(y=1|x).
σ(z) curve with marker at the currently hovered point's z = w·x+b.
BCE
acc

k-Fold Cross-Validation

Split the data into k folds; train on k-1 and validate on the held-out one. Rotate. The averaged validation error is a less-noisy estimate of test error than a single holdout split.

fold MSE
CV MSE
Held-out fold shown in orange. Train MSE on blue, validation MSE on orange. Sweep degree to find the bias/variance sweet spot.

ROC, PR & Confusion Matrix

A trained classifier outputs a score; sliding the decision threshold trades false positives against false negatives. ROC plots TPR vs FPR; PR plots precision vs recall.

Class score histograms — drag the τ slider to move the threshold.
ROC curve. AUC printed below. Current threshold marked.
Precision-Recall curve.

Confusion matrix at τ

Accuracy Precision Recall F1 AUC

Hierarchical Clustering

Agglomerative: start with each point as its own cluster, repeatedly merge the closest pair. The merge order yields a dendrogram — cut at any height to get a flat clustering.

Click to add points. Lines show last merges.
Dendrogram — drag the horizontal cut line via the slider.
clusters

Anomaly Detection

Learn what "normal" looks like, then flag points the model finds surprising. Three classic flavours: Gaussian likelihood, Mahalanobis distance, and a one-class radius around the densest region.

flagged
Click to add a single point and see if it's flagged. Heat-map shows the anomaly score across input space; circled points are above the contamination cutoff.