ml-lab · the foundational machine-learning pipeline, visualised

1. Train/test split & overfitting

Fit a polynomial of degree $d$ to noisy training points, then measure error on a held-out test set. Low degree under-fits; high degree chases the noise and the test error rises even as training error keeps falling — the signature of overfitting.

polynomial degree $d$ 3 noise $\sigma$ 0.25 train fraction 0.60

train RMSE—

test RMSE—

verdict—

2. Gradient descent — linear regression

Minimise the mean-squared-error loss $J(w,b)=\frac1m\sum_i (w x_i+b-y_i)^2$ by stepping downhill: $w \leftarrow w-\eta\,\partial J/\partial w$. Watch the line settle onto the data and the loss decay. Too large a learning rate $\eta$ diverges.

learning rate $\eta$ 0.05

iteration0

$w,\,b$—

MSE loss—

3. Bias–variance tradeoff

Expected test error decomposes as $\text{bias}^2+\text{variance}+\text{irreducible noise}$. As model complexity grows, bias falls but variance climbs. The sweet spot minimises their sum.

complexity 5 noise floor 0.15

bias²—

variance—

total error—

optimum at—

4. Feature scaling & standardisation

Many algorithms assume comparable feature ranges. Standardisation $z=(x-\mu)/\sigma$ centres and unit-scales each axis; min–max maps to $[0,1]$. See the raw cloud transform and read the new statistics.

method

mean (x, y)—

std (x, y)—

range x—

range y—

5. k-nearest neighbours

A non-parametric classifier: a point is labelled by majority vote of its $k$ closest neighbours. The shaded background is the decision region. Click to drop a query point; raise $k$ to smooth the boundary.

neighbours $k$ 5

query class—

votes—

train accuracy—

Click the canvas to classify a new query point.

6. Logistic regression — a linear classifier

A perceptron-style model predicts $\hat p=\sigma(w_1x+w_2y+b)$ and learns by gradient descent on cross-entropy loss. The line is the $\hat p=0.5$ decision boundary; the shading is the predicted probability.

learning rate $\eta$ 0.30

epoch0

cross-entropy—

accuracy—

7. Decision tree — recursive splitting

A tree greedily picks the axis-aligned split that most reduces Gini impurity, recursing until a depth limit. Deeper trees carve the plane into smaller, purer boxes — and start to overfit.

max depth 3

leaves—

root Gini—

train accuracy—

8. k-means clustering

Unsupervised: assign each point to its nearest centroid, then move each centroid to the mean of its members. Repeat to convergence. Watch the inertia (within-cluster sum of squares) fall at each step.

clusters $k$ 3

iteration0

inertia (WCSS)—

moved—

9. Principal component analysis

PCA finds the orthogonal directions of greatest variance. The first principal component (drawn solid) is the eigenvector of the covariance matrix with the largest eigenvalue. Projecting onto it is the best 1-D summary of the cloud.

correlation 0.80 show projection onto PC1

$\lambda_1,\,\lambda_2$—

PC1 angle—

variance kept—

10. Confusion matrix & the threshold

A binary classifier outputs scores; a threshold turns them into labels. Slide it to trade false positives against false negatives, and read accuracy, precision, recall and $F_1$ live.

threshold 0.50

TP · FP · FN · TN—

accuracy—

precision—

recall—

$F_1$ score—

11. ROC curve & AUC

Sweeping the threshold traces the receiver-operating-characteristic curve: true-positive rate vs false-positive rate. The area under it (AUC) is the probability the model ranks a random positive above a random negative. Better separation pushes the curve toward the top-left.

class separation 1.5 threshold 0.50

TPR (sensitivity)—

FPR (1−specificity)—

AUC—