ml-lab pipeline · gradient descent · evaluation · core algorithms

1. Train/test split & overfitting

Fit a polynomial of degree $d$ to noisy training points, then measure error on a held-out test set. Low degree under-fits; high degree chases the noise and the test error rises even as training error keeps falling — the signature of overfitting.

train RMSE
test RMSE
verdict

2. Gradient descent — linear regression

Minimise the mean-squared-error loss $J(w,b)=\frac1m\sum_i (w x_i+b-y_i)^2$ by stepping downhill: $w \leftarrow w-\eta\,\partial J/\partial w$. Watch the line settle onto the data and the loss decay. Too large a learning rate $\eta$ diverges.

iteration0
$w,\,b$
MSE loss

3. Bias–variance tradeoff

Expected test error decomposes as $\text{bias}^2+\text{variance}+\text{irreducible noise}$. As model complexity grows, bias falls but variance climbs. The sweet spot minimises their sum.

bias²
variance
total error
optimum at

4. Feature scaling & standardisation

Many algorithms assume comparable feature ranges. Standardisation $z=(x-\mu)/\sigma$ centres and unit-scales each axis; min–max maps to $[0,1]$. See the raw cloud transform and read the new statistics.

mean (x, y)
std (x, y)
range x
range y

5. k-nearest neighbours

A non-parametric classifier: a point is labelled by majority vote of its $k$ closest neighbours. The shaded background is the decision region. Click to drop a query point; raise $k$ to smooth the boundary.

query class
votes
train accuracy

Click the canvas to classify a new query point.

6. Logistic regression — a linear classifier

A perceptron-style model predicts $\hat p=\sigma(w_1x+w_2y+b)$ and learns by gradient descent on cross-entropy loss. The line is the $\hat p=0.5$ decision boundary; the shading is the predicted probability.

epoch0
cross-entropy
accuracy

7. Decision tree — recursive splitting

A tree greedily picks the axis-aligned split that most reduces Gini impurity, recursing until a depth limit. Deeper trees carve the plane into smaller, purer boxes — and start to overfit.

leaves
root Gini
train accuracy

8. k-means clustering

Unsupervised: assign each point to its nearest centroid, then move each centroid to the mean of its members. Repeat to convergence. Watch the inertia (within-cluster sum of squares) fall at each step.

iteration0
inertia (WCSS)
moved

9. Principal component analysis

PCA finds the orthogonal directions of greatest variance. The first principal component (drawn solid) is the eigenvector of the covariance matrix with the largest eigenvalue. Projecting onto it is the best 1-D summary of the cloud.

$\lambda_1,\,\lambda_2$
PC1 angle
variance kept

10. Confusion matrix & the threshold

A binary classifier outputs scores; a threshold turns them into labels. Slide it to trade false positives against false negatives, and read accuracy, precision, recall and $F_1$ live.

TP · FP · FN · TN
accuracy
precision
recall
$F_1$ score

11. ROC curve & AUC

Sweeping the threshold traces the receiver-operating-characteristic curve: true-positive rate vs false-positive rate. The area under it (AUC) is the probability the model ranks a random positive above a random negative. Better separation pushes the curve toward the top-left.

TPR (sensitivity)
FPR (1−specificity)
AUC