Statistical Learning & Prediction
An interactive companion to the SLP course. Every section below is a live, click-and-drag visualisation of a core concept — from k-NN and bias/variance up to kernel SVMs, ensembles, and PCA. Drag points, move sliders, watch the model react.
How to use
- Click on any canvas to add data points.
- Drag sliders to change hyper-parameters; the model re-fits live.
- Use Reset / Resample buttons in each demo.
- Two-class problems use blue and pink.
Topics covered
k-NN · linear/ridge/lasso regression · bias–variance · gradient descent · softmax + SGD · activation functions · MLP + backprop · decision trees · bagging · random forest · AdaBoost · kernel trick · soft-margin SVM · PCA · MDS · k-means
k-Nearest Neighbours
A lazy, non-parametric classifier: to predict a point's class, look at its k nearest training points and take a (weighted) majority vote. No training — all the work happens at query time.
Intuition
Small k ⇒ jagged, high-variance decision boundary that fits noise. Large k ⇒ smoother but biased boundary. Plot a validation curve over k to choose. Distance metric and feature scaling matter a lot.
Linear Regression
Fit ŷ = w·x + b by minimising mean-squared error. Drag points around — the OLS solution updates in closed form.
Loss Functions
A loss measures how wrong a single prediction is. Different losses give very different sensitivities to outliers and large errors.
L2 punishes large errors quadratically (sensitive to outliers). L1 is robust but non-differentiable at 0. Huber blends both. Hinge and log-loss are the SVM and logistic regression losses.
Regularization — Ridge & Lasso
Adding a penalty on the coefficient norm shrinks weights, trading bias for reduced variance. L2 (ridge) shrinks smoothly; L1 (lasso) zeroes coefficients — it's a feature selector.
Coefficient bar chart
Watch coefficients shrink as λ grows. With lasso, many bars snap to exactly zero — that's sparsity.
Bias–Variance Tradeoff
Expected test error decomposes into bias² + variance + noise. Simple models have high bias; complex ones have high variance. The sweet spot minimises their sum.
Gradient Descent
Walk downhill on a loss surface by stepping opposite to the gradient. Step-size (learning rate) makes or breaks it.
Softmax Regression with SGD
Multinomial logistic regression: outputs a probability over K classes via softmax, trained by minimising cross-entropy with mini-batch SGD.
Activation Functions
The non-linearity inside each neuron. Choosing the right one controls saturation, dead units, and gradient flow.
Sigmoid & tanh saturate ⇒ vanishing gradients. ReLU is cheap but can die. LeakyReLU/ELU/GELU/Swish are common modern choices.
MLP & Backpropagation
A small fully-connected net trained live on a 2-D classification task. Backprop is just chained partial derivatives — watch the boundary curve as gradients flow.
Decision Trees
Recursively split feature space along axis-aligned cuts that minimise impurity (Gini / entropy). Each leaf predicts the majority class within its rectangle.
Bagging — Bootstrap Aggregation
Train many high-variance learners on bootstrap resamples of the data, then average. Variance ↓, bias ≈ unchanged.
Random Forest
Bagging + random feature subset at each split — de-correlates the trees, reducing ensemble variance further.
Boosting — AdaBoost
Sequentially fit weak learners (decision stumps) on re-weighted data — misclassified samples get more attention each round.
The Kernel Trick
Non-linearly separable data in low dimensions often becomes linearly separable when lifted to higher dimensions. Kernels compute that inner product implicitly — no explicit feature map.
Support Vector Machine
Maximum-margin linear classifier with slack (soft margin). Only the support vectors determine the boundary.
Principal Component Analysis
Find orthogonal axes that capture the most variance. Drag points — the principal components rotate to follow the data's spread.
Multidimensional Scaling
Embed objects into 2-D such that pairwise Euclidean distances best match a given dissimilarity matrix. Useful when you only have similarities, not coordinates.
k-Means Clustering
Pick k centroids, assign each point to its nearest one, recompute centroids — repeat until stable. Greedy local-minimum descent on within-cluster sum-of-squares.
Logistic Regression
Linear classifier with a sigmoid squash. Fit by maximising likelihood (= minimising binary cross-entropy). Output is a probability — perfect when you need calibrated scores.
k-Fold Cross-Validation
Split the data into k folds; train on k-1 and validate on the held-out one. Rotate. The averaged validation error is a less-noisy estimate of test error than a single holdout split.
ROC, PR & Confusion Matrix
A trained classifier outputs a score; sliding the decision threshold trades false positives against false negatives. ROC plots TPR vs FPR; PR plots precision vs recall.
Confusion matrix at τ
Hierarchical Clustering
Agglomerative: start with each point as its own cluster, repeatedly merge the closest pair. The merge order yields a dendrogram — cut at any height to get a flat clustering.
Anomaly Detection
Learn what "normal" looks like, then flag points the model finds surprising. Three classic flavours: Gaussian likelihood, Mahalanobis distance, and a one-class radius around the densest region.