1. Train/test split & overfitting
Fit a polynomial of degree $d$ to noisy training points, then measure error on a held-out test set. Low degree under-fits; high degree chases the noise and the test error rises even as training error keeps falling — the signature of overfitting.
2. Gradient descent — linear regression
Minimise the mean-squared-error loss $J(w,b)=\frac1m\sum_i (w x_i+b-y_i)^2$ by stepping downhill: $w \leftarrow w-\eta\,\partial J/\partial w$. Watch the line settle onto the data and the loss decay. Too large a learning rate $\eta$ diverges.
3. Bias–variance tradeoff
Expected test error decomposes as $\text{bias}^2+\text{variance}+\text{irreducible noise}$. As model complexity grows, bias falls but variance climbs. The sweet spot minimises their sum.
4. Feature scaling & standardisation
Many algorithms assume comparable feature ranges. Standardisation $z=(x-\mu)/\sigma$ centres and unit-scales each axis; min–max maps to $[0,1]$. See the raw cloud transform and read the new statistics.
5. k-nearest neighbours
A non-parametric classifier: a point is labelled by majority vote of its $k$ closest neighbours. The shaded background is the decision region. Click to drop a query point; raise $k$ to smooth the boundary.
Click the canvas to classify a new query point.
6. Logistic regression — a linear classifier
A perceptron-style model predicts $\hat p=\sigma(w_1x+w_2y+b)$ and learns by gradient descent on cross-entropy loss. The line is the $\hat p=0.5$ decision boundary; the shading is the predicted probability.
7. Decision tree — recursive splitting
A tree greedily picks the axis-aligned split that most reduces Gini impurity, recursing until a depth limit. Deeper trees carve the plane into smaller, purer boxes — and start to overfit.
8. k-means clustering
Unsupervised: assign each point to its nearest centroid, then move each centroid to the mean of its members. Repeat to convergence. Watch the inertia (within-cluster sum of squares) fall at each step.
9. Principal component analysis
PCA finds the orthogonal directions of greatest variance. The first principal component (drawn solid) is the eigenvector of the covariance matrix with the largest eigenvalue. Projecting onto it is the best 1-D summary of the cloud.
10. Confusion matrix & the threshold
A binary classifier outputs scores; a threshold turns them into labels. Slide it to trade false positives against false negatives, and read accuracy, precision, recall and $F_1$ live.
11. ROC curve & AUC
Sweeping the threshold traces the receiver-operating-characteristic curve: true-positive rate vs false-positive rate. The area under it (AUC) is the probability the model ranks a random positive above a random negative. Better separation pushes the curve toward the top-left.