AI: Statistical Learning & PredictionCourse structure & syllabus

Course Structure

A full, syllabus-driven map of AI: Statistical Learning & Prediction — every module and all 30 sessions, with the core idea, the key formula, and pointers to the matching interactive demo. Use this page to study ahead, review, and connect the live visualisations on the demos page to where they appear in the course.

CodeAISLP-CSAI.3.M.B
DegreeBCSAI (Year 3)
AreaComputer Science
Credits6.0 ECTS
Sessions30 (live in-person)
Semester2nd · AY 25–26
CategoryCompulsory
LanguageEnglish
Workload150 hours
ProfessorLuciano Dyballa

Overview

This course introduces some of the fundamental algorithms in machine learning, with an emphasis on their theoretical foundations and underlying mathematical principles. Examples using a variety of datasets are presented to build intuition for how the different methods work, and assignments give hands-on experience applying the methods to different types of data. Topics include nearest neighbours, neural networks, support vector machines, trees, clustering, dimensionality reduction, and generative models. Many lectures include a short coding demo in Python.

Tentative pacing. The syllabus notes that the topic schedule is tentative: the instructor aims to cover all listed topics (and may add advanced material), but the pace adapts to group performance, so some sessions may shift.

Format

30 live in-person sessions combining theory, worked examples, and short Python coding demos. Most new techniques are followed immediately by one or more examples.

Prerequisites

Calculus for CS · Matrices & Linear Transformations · Probability for CS · Machine Learning Foundations · Computer Programming II.

Instructor

Luciano Dyballa, Assistant Professor (School of Science & Technology). Ph.D. Computer Science, Yale — machine learning, vision, computational neuroscience. Contact: ldyballa@faculty.ie.edu (office hours on request).

Companion demos

The interactive demos page mirrors this syllabus: k-NN, regression, bias–variance, gradient descent, MLP/backprop, trees, bagging, random forest, AdaBoost, kernels, SVM, PCA, MDS, k-means, hierarchical clustering and more — all running client-side.

Prerequisites & how this course connects

Each prerequisite is load-bearing: the course is built on their machinery, and the table below makes the dependency explicit so you can refresh the right tool before the session that needs it.

Calculus for CSGradients, the chain rule and partial derivatives — the engine behind gradient descent (S4) and backpropagation (S5). If $\nabla$ and $\frac{\partial}{\partial w}$ feel rusty, review before Module 1.
Matrices & Linear TransformationsVectors, matrix products, projections, eigenvalues/eigenvectors and quadratic forms — needed for the MLP forward pass, the SVM $\lVert w\rVert$, and PCA's eigendecomposition (S22).
Probability for CSDistributions, expectation, variance and conditional probability — the language of cross-entropy, the bias–variance decomposition (S13), and generative models (S26–27).
Machine Learning FoundationsThe supervised-learning frame, train/test splits and basic evaluation — assumed from Session 1, then deepened with theory and mathematics.
Computer Programming IIPython proficiency (NumPy-style array thinking) — every coding demo and problem set implements algorithms in Python.
Weekly study load. The 6-ECTS / 150-hour budget works out to roughly 5 hours of independent work per session beyond class time: about 1.5 h on readings, 2.5 h on the problem set / coding, and 1 h reviewing the previous session and preparing the next. Front-load Module 1 (the neural-network mathematics compounds) and protect time before Sessions 8 and 30. Treat the interactive demos as active recall: predict what a slider will do, then check.

Learning Objectives

By the end of the course, students should be able to:

Teaching Methodology

Before each class, students prepare assignments and readings at home. Lectures combine theoretical explanation with practical examples and a short coding demonstration. Active in-class participation is essential to acquiring the skills to understand, implement, and apply each algorithm. Problem sets build intuition behind the theory and the coding skills to implement the algorithms; brief quizzes throughout the semester check understanding of previously taught material and let the instructor track the class's progress.

Learning-activity weighting & estimated time (150 h total)

Lectures
30.0%
Individual studying
30.0%
Exercises / async / field work
20.0%
Discussions
13.3%
Group work
6.7%

Estimated student time: Lectures 45 h · Individual studying 45 h · Exercises/async/field work 30 h · Discussions 20 h · Group work 10 h.

GenAI policy. Generative-AI tools may be used for some specific tasks/assignments with appropriate acknowledgment, but not in group projects, quizzes, or exams. Inappropriate use of AI-generated content is treated as academic misconduct and may result in failing the assignment or the course. The syllabus provides a suggested acknowledgment template; acknowledging AI does not affect your grade. All written work (including code) must be your own — credit any hints or solutions from peers or outside sources.

Assessment

Continuous evaluation across four components:

Final Exam
40%
Midterm Exam
25%
Group Project
25%
Class Participation
10%

Final Exam

40%

Comprehensive, in Session 30. You must score at least 3.5 on it to pass the course overall — even if your other assessments would otherwise be passing. Format: closed-book, on-campus; mixes short conceptual questions, derivations, and "choose-and-justify" model-selection problems. Tip: practise re-deriving the key formulas on this page from scratch rather than memorising them.

Midterm Exam

25%

Held in Session 8, covering the foundations and neural-network material from Sessions 1–7. Format: closed-book; expect a hand-worked gradient-descent / backprop step and reasoning about over/under-fitting. Tip: the MLP and gradient-descent demos let you sanity-check the mechanics you will be asked to reproduce.

Group Project

25%

A written report plus an in-class presentation (Sessions 28–29). Deliverables: a report (problem, data, methods, evaluation, discussion) and slides. GenAI may not be used. Tip: pick a dataset early, justify every modelling choice, and report honest baselines and error bars.

Class Participation

10%

Selected exercises turned in throughout the course (may be presented in class), plus active participation, questions/remarks, punctuality, and class conduct. Tip: turn in every announced exercise even when imperfect — consistency and engagement are what is rewarded.

What strong work looks like — rubric per component

ComponentWhat earns top marksCommon point-losers
Final / MidtermCorrect derivations with each step justified; precise definitions; model choices argued from bias–variance and the data, not by name-dropping.Memorised formulas applied to the wrong setting; skipped algebra; "it works well" with no reasoning.
Group reportClear problem framing, principled train/validation/test protocol, baselines, appropriate metrics with uncertainty, honest discussion of limitations.Test-set leakage, single metric with no baseline, unjustified hyperparameters, results without error analysis.
PresentationTight narrative, one idea per slide, a readable figure of the result, clean handling of questions, all members contributing.Wall-of-text slides, no visual of the result, running over time, one person presenting everything.
ParticipationExercises submitted on time, thoughtful questions and remarks, helping peers reason, punctuality.Missing submissions, passive attendance, lateness.

Pass, attendance & re-sit rules

Program — all 30 sessions

The 30 sessions are grouped below into nine thematic modules. Each session lists its objective, per-topic explanations, the core formula or definition where the material is technical, a "key idea" takeaway, suggested readings, and links to the matching interactive demo.

Module 1

Foundations & Neural Networks

Sessions 1–6 · from nearest neighbours to deep networks

Establishes the supervised-learning setting and quickly builds the neural-network toolkit: linear classifiers and the perceptron, the multi-layer perceptron, the loss/activation choices that shape training, and the optimisation machinery (gradient descent and backpropagation) that makes deep networks work.

Module learning outcomes
  • Explain the curse of dimensionality and when distance-based methods break down.
  • Derive the perceptron update and connect it to linear decision boundaries.
  • Build and train an MLP, choosing appropriate loss and activation functions.
  • Run gradient descent and backpropagation by hand on a small network.
1

Introduction · Nearest neighbours · Curse of dimensionality

Objective: frame the supervised-learning problem and meet the simplest non-parametric classifier.

Introduction

The learning setting: features $x$, targets $y$, a hypothesis class, and the goal of generalising to unseen data. Regression vs. classification; training vs. test error.

Nearest neighbours

A lazy, non-parametric rule: predict using the $k$ closest training points under a distance metric. No training phase — all computation happens at query time. Small $k$ gives a jagged, high-variance boundary; large $k$ smooths but biases it.

$\hat{y}(x)=\text{majority vote of }\{y_i : x_i \in N_k(x)\},\quad d(x,x_i)=\lVert x-x_i\rVert_2$k-NN prediction: vote among the $k$ nearest neighbours under Euclidean (or other) distance.
Curse of dimensionality

In high dimensions, data becomes sparse and almost all points are roughly equidistant, so "nearest" loses meaning and volume concentrates in thin shells — distance-based methods degrade and require exponentially more data. Concretely, to keep a fixed fraction $r$ of the data within a neighbourhood in $p$ dimensions, the neighbourhood's edge must span $r^{1/p}$ of each axis: capturing 10% of the data needs edge length $0.10^{1/10}\approx 0.79$ — 79% of the whole range, so the "local" neighbourhood is almost global.

Worked mini-exampleTraining points $x_1=(0,0)\!\to\!\text{A}$, $x_2=(2,0)\!\to\!\text{B}$, $x_3=(0,2)\!\to\!\text{B}$. Query $q=(0.5,0.5)$, $k=1$: distances are $\sqrt{0.5}\approx0.71$, $\sqrt{2.5}\approx1.58$, $\sqrt{2.5}\approx1.58$, so the nearest is $x_1$ → predict A. With $k=3$ the vote is {A,B,B} → predict B: increasing $k$ flips and smooths the decision.
Pitfall: k-NN is scale-sensitive — a feature measured in thousands dominates the Euclidean distance over one measured in fractions. Always standardise features (zero mean, unit variance) before computing distances.
Key idea: k-NN trades zero training cost for expensive, distance-sensitive prediction — and that distance metric collapses as dimensionality grows.

Connects to: $k$ is a bias–variance dial (S13) — small $k$ = low bias/high variance; large $k$ = the reverse — and the distance breakdown motivates the dimensionality reduction of Module 7.

k-NN demo
  • ESL ch. 2 (statistical learning, k-NN, curse of dimensionality) — the foundational chapter; read §2.3–2.5 closely.
  • ESL §13.3 — k-nearest-neighbour classifiers (prototype methods & metric choice).
2

Review of linear classifiers · The perceptron

Objective: recall linear decision boundaries and meet the perceptron learning rule.

Linear classifiers

A hyperplane $w^\top x + b = 0$ separates the input space; the sign of the score decides the class. Geometry of weight vectors, bias/offset, and margins.

$\hat{y}=\operatorname{sign}(w^\top x + b)$Linear classifier: predict by the sign of the signed distance to the hyperplane.
The perceptron

An online, mistake-driven algorithm: on each misclassified example, nudge the weights toward correctness. Guaranteed to converge if the data is linearly separable.

$w \leftarrow w + \eta\,(y_i - \hat{y}_i)\,x_i$Perceptron update — applied only on misclassified points.
Worked mini-exampleStart $w=(0,0),\,b=0$, learning rate $\eta=1$. Point $x_1=(2,1),\,y_1=+1$: score $=0$, $\operatorname{sign}(0)$ is wrong, so $w\leftarrow w+1\cdot 2\cdot(2,1)=(4,2)$, $b\leftarrow 2$. Next $x_2=(-1,-1),\,y_2=-1$: score $=4(-1)+2(-1)+2=-4<0$ → correct, no update. The boundary has tilted toward separating the two points after a single mistake.
Pitfall: if the data is not linearly separable the perceptron never converges — it cycles forever. This non-convergence (and the lack of a margin) is exactly what the SVM in Module 3 fixes.
Key idea: the perceptron is the atomic building block of neural nets — a single linear unit with a step nonlinearity, trained by correcting its own mistakes.

Connects to: swap the step for a sigmoid and the mistake-driven update for a gradient and you have logistic regression (S4); stack many such units and you get the MLP (S3).

Logistic regression demo
  • ESL §4.5 — separating hyperplanes & the perceptron (geometry & the convergence theorem).
  • Nielsen ch. 1 — perceptrons & sigmoid neurons (why we soften the step).
3

Intro to neural nets · Multi-layer perceptron

Objective: stack neurons into layers to represent non-linear functions.

Intro to neural nets

Neurons as weighted sums passed through a nonlinearity; layers as composed transformations. A single linear layer cannot solve XOR because the four points $(0,0),(1,1)\!\to\!0$ and $(0,1),(1,0)\!\to\!1$ are not linearly separable — no single line splits them. The universal-approximation theorem guarantees that one hidden layer with enough units can approximate any continuous function, though depth often reaches the same accuracy with far fewer units.

Multi-layer perceptron (MLP)

A feed-forward network of fully-connected layers. Hidden layers learn intermediate representations, giving the universal-approximation capacity to carve curved boundaries.

$h^{(l)} = \sigma\!\left(W^{(l)} h^{(l-1)} + b^{(l)}\right)$Forward pass: each layer is an affine map followed by an elementwise activation $\sigma$.
Worked mini-example (XOR in two layers)A hidden layer with $h_1=\text{ReLU}(x_1+x_2)$ and $h_2=\text{ReLU}(x_1+x_2-1)$, then output $h_1-2h_2$, computes XOR exactly: input $(1,1)$ gives $h_1=2,h_2=1\Rightarrow 2-2=0$; input $(1,0)$ gives $h_1=1,h_2=0\Rightarrow 1$. The hidden layer re-represents the inputs so the output layer's job becomes linearly separable.
Pitfall: if every activation were linear, stacking layers would collapse to a single linear map $W_2W_1x$ — no extra power. The nonlinearity $\sigma$ is what makes depth meaningful.
Key idea: depth + nonlinearity = expressive power. Hidden units let the network bend straight boundaries into the shapes the data needs.

Connects to: the affine-then-activation pattern here is exactly what backprop (S5) differentiates layer by layer, and the activation choice drives the gradient pathologies of S6.

MLP & backprop demo Activation functions demo
  • Nielsen ch. 1–2 — feed-forward nets (incl. the visual universal-approximation argument in ch. 4).
  • Deep Learning ch. 6 — deep feed-forward networks (architecture & output units).
4

Loss functions, activation functions · Gradient descent

Objective: define what to minimise and how to start minimising it.

Loss functions

How wrong a prediction is: squared error (L2) for regression, cross-entropy/log-loss for classification, hinge for SVMs. L2 is outlier-sensitive; L1 is robust but non-smooth; Huber blends the two.

$\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_i \big[y_i\log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\big]$Binary cross-entropy — the loss behind logistic regression and classification nets.
Activation functions

Sigmoid and tanh saturate (vanishing gradients); ReLU is cheap but can "die"; LeakyReLU/ELU/GELU/Swish are modern fixes that keep gradients flowing.

Gradient descent

Move parameters opposite the gradient of the loss. Learning rate controls step size — too large diverges, too small crawls.

$\theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal{L}(\theta)$Gradient-descent update with learning rate $\eta$.
Worked mini-example (one GD step)Minimise $\mathcal{L}(\theta)=\theta^2$ with $\eta=0.1$ from $\theta_0=4$. Gradient $\nabla\mathcal{L}=2\theta=8$, so $\theta_1=4-0.1\cdot8=3.2$; next $\theta_2=3.2-0.1\cdot6.4=2.56$. Each step multiplies $\theta$ by $(1-2\eta)=0.8$, converging geometrically. With $\eta=0.6$ the factor is $-0.2$ and it oscillates; with $\eta>1$ the factor exceeds 1 in magnitude and it diverges.
Pitfall: pairing a sigmoid output with squared-error loss flattens the gradient when the neuron saturates (the $\sigma'$ term vanishes), so learning stalls on confident-but-wrong predictions. Cross-entropy cancels that $\sigma'$ and keeps the gradient strong — which is why it is the default for classification.
Key idea: the loss defines the objective; the activation shapes the gradient landscape; the learning rate decides whether you reach the bottom of it.

Connects to: the learning-rate behaviour above scales to every network here; saturating activations foreshadow vanishing gradients (S6), and hinge loss reappears as the SVM objective (S10–11).

Loss functions demo Activations demo Gradient descent demo
  • Deep Learning §4.3, ch. 6.2 — gradient-based optimisation & output units (why output unit and loss are chosen together).
  • Nielsen ch. 3 — the cross-entropy cost (the learning-slowdown argument).
5

Backpropagation

Objective: compute gradients through a deep network efficiently.

Backpropagation

The chain rule applied layer-by-layer: a forward pass caches activations, then errors are propagated backward to give every weight's gradient in one sweep. This is what makes training deep nets tractable.

$\delta^{(l)} = \big(W^{(l+1)\top}\delta^{(l+1)}\big)\odot\sigma'\!\big(z^{(l)}\big),\qquad \frac{\partial \mathcal{L}}{\partial W^{(l)}} = \delta^{(l)} \, h^{(l-1)\top}$Backprop: recursively propagate the error signal $\delta$, then read off weight gradients.
Worked mini-example (chain rule)Tiny net $z=wx$, $a=\sigma(z)$, loss $\mathcal{L}=\tfrac12(a-y)^2$. Then $\frac{\partial\mathcal{L}}{\partial w}=(a-y)\cdot\sigma'(z)\cdot x$. With $x=1,w=0.5,y=0$: $z=0.5$, $a=\sigma(0.5)\approx0.62$, $\sigma'(z)=a(1-a)\approx0.24$, so the gradient is $0.62\cdot0.24\cdot1\approx0.15$ — one step of $\eta=0.1$ moves $w$ to $\approx0.485$. The three factors are exactly "error × local slope × input."
Pitfall: backprop needs the activations from the forward pass cached; recomputing or forgetting them is a classic bug. And the $\sigma'$ factor is the seed of vanishing gradients — multiply many sub-1 slopes through deep layers and the signal dies (S6).
Key idea: backprop is just the chain rule organised so that each gradient is computed exactly once — reverse-mode automatic differentiation.

Connects to: this is the algorithm every framework's .backward() implements, and the per-layer $\sigma'$ product directly explains the training pathologies of S6.

MLP & backprop demo
  • Nielsen ch. 2 — how backpropagation works (the four backprop equations, derived from scratch).
  • Deep Learning §6.5 — back-propagation (general computational graphs).
6

Vanishing/exploding gradients · Improving MLPs · Deep networks

Objective: understand why deep nets are hard to train and how to fix it.

Vanishing/exploding gradients

Repeated multiplication of derivatives through many layers shrinks or blows up the gradient signal, stalling or destabilising training — especially with saturating activations.

Improving MLPs

Practical remedies: better activations (ReLU family), careful weight initialisation (Xavier/He), normalisation, regularization (L2, dropout), and adaptive optimisers (momentum, Adam).

Deep networks

Why depth helps representationally (each layer composes features of the previous one, so functions that need exponentially many units in one wide layer can be expressed with linearly many in a deep stack), and the engineering needed to make many layers trainable in practice.

Worked mini-example (why gradients vanish)With sigmoids, $\sigma'\le0.25$. A backprop signal crossing 10 layers is scaled by at most $0.25^{10}\approx10^{-6}$ — effectively zero, so early layers barely learn. He-initialised ReLU keeps the expected gradient scale near 1 per layer, which is why deep ReLU nets train where deep sigmoid nets stall.
Pitfall: too-large weights or learning rate cause the mirror failure — exploding gradients and NaN losses. Gradient clipping and careful initialisation are the standard fixes.
Key idea: depth is powerful but fragile — activation choice, initialisation, normalisation, and optimiser turn an untrainable deep net into a working one.

Connects to: the $\sigma'$ product from backprop (S5) is the root cause here, and L2/dropout regularization links forward to the losses-and-regularizers theory of S12.

Activations demo SGD & softmax demo
  • Nielsen ch. 5 — why deep nets are hard to train (the unstable-gradient problem, quantified).
  • Deep Learning ch. 8 — optimisation for training deep models (momentum, Adam, normalisation).
Module 2

Consolidation & Midterm

Sessions 7–8 · review and first checkpoint

A consolidation session with exercises pulling together the neural-network material, followed by the midterm exam.

Module learning outcomes
  • Integrate the foundations + neural-network material into one mental model.
  • Apply the methods to fresh exercises under exam conditions.
7

Neural networks review · Exercises

Objective: consolidate Sessions 1–6 through worked problems.

Review & exercises

End-to-end recap: from the supervised setting and k-NN through perceptrons, MLPs, losses/activations, gradient descent, backprop, and the training pathologies of deep nets — practised on exercises.

Key idea: a single coherent pipeline — represent, score, lose, differentiate, update — underlies everything seen so far.
MLP demo Gradient descent demo
  • Review Nielsen ch. 1–3 & Deep Learning ch. 6
8

Midterm Exam

Objective: assess mastery of foundations and neural networks. Worth 25% of the overall grade.

Coverage

Material from Sessions 1–7: supervised learning, k-NN, curse of dimensionality, linear classifiers/perceptron, MLPs, loss & activation functions, gradient descent, backpropagation, and deep-network training issues.

Key idea: the midterm rewards conceptual derivations and the ability to reason about model behaviour, not just recall.
  • Cumulative review of all Module 1 readings
Module 3

Trees, Losses & the Bias–Variance Decomposition

Sessions 9–13 · classical supervised learning & theory

Moves to interpretable tree models and the linear SVM, then formalises the theory tying them together: the choice of loss and regularizer, and the bias–variance decomposition that explains over- and under-fitting.

Module learning outcomes
  • Grow and prune a decision tree using impurity-based splits.
  • Formulate the linear SVM as a margin-maximisation problem with slack.
  • Choose losses and regularizers and explain their effect on the solution.
  • Decompose expected test error into bias, variance, and noise.
9

Decision trees

Objective: learn axis-aligned, interpretable partitions of feature space.

Decision trees

Recursively split on the feature/threshold that most reduces impurity; each leaf predicts the majority class (or mean) of its region. Deep trees overfit; pruning / depth limits control complexity.

$\text{Gini}(t)=1-\sum_{c} p_{c}^2,\qquad H(t)=-\sum_{c} p_c\log p_c$Node impurity measures — splits are chosen to maximise the weighted impurity decrease.
Worked mini-example (a split's impurity gain)A node of 10 samples (5 +, 5 −) has $\text{Gini}=1-(0.5^2+0.5^2)=0.5$. A split sends {4+,1−} left and {1+,4−} right; each child has $\text{Gini}=1-(0.8^2+0.2^2)=0.32$. Weighted child impurity $=\tfrac{5}{10}(0.32)+\tfrac{5}{10}(0.32)=0.32$, so the impurity decrease is $0.5-0.32=0.18$ — the tree picks the feature/threshold maximising this gain.
Pitfall: a fully grown tree can memorise the training set (zero training error, terrible test error) and is unstable — a small data change can reshape it entirely. Control depth / minimum-leaf size, or prune. That very instability is what bagging exploits.
Key idea: trees are greedy, high-variance, perfectly interpretable — and the ideal base learner for the ensembles to come.

Connects to: the high variance here is precisely what random forests (S14) average away, and trees are the weak learners boosting (S15) reweights.

Decision tree demo
  • ESL §9.2 — tree-based methods (CART) (splitting, pruning & cost-complexity).
10

Linear SVMs

Objective: find the maximum-margin separating hyperplane.

Linear SVMs

Among all separating hyperplanes, choose the one that maximises the margin — the distance to the closest points (support vectors). Maximising the margin is minimising $\lVert w\rVert$.

$\min_{w,b}\tfrac12\lVert w\rVert^2 \;\text{ s.t. }\; y_i(w^\top x_i + b)\ge 1\;\;\forall i$Hard-margin SVM: the margin equals $2/\lVert w\rVert$.
Worked mini-example (margin width)One-dimensional points $x=-1\,(y{=}{-}1)$ and $x=+1\,(y{=}{+}1)$. The maximum-margin boundary is $x=0$ with $w=1,b=0$: the constraints $y_i(wx_i+b)\ge1$ are tight ($1\cdot1\ge1$), both points are support vectors, and the margin is $2/\lVert w\rVert=2$. Push the positive point to $x=+2$ and the optimal $w$ shrinks to $2/3$, widening the street.
Pitfall: the hard-margin SVM has no solution if the classes overlap even slightly — a single mislabelled point makes the constraints infeasible. This is the motivation for the soft margin in S11.
Key idea: only the support vectors matter; the widest "street" between classes is the most robust boundary.

Connects to: maximising the margin is minimising $\lVert w\rVert^2$ — the same L2 penalty seen in ridge regression (S12); the dual sets up kernels (Module 5).

SVM demo
  • ESL §12.1–12.2 — support vector classifiers (the optimal separating hyperplane).
11

Linear SVMs part II

Objective: handle non-separable data with soft margins and the dual.

Soft-margin SVM

Introduce slack variables $\xi_i$ to allow violations, controlled by penalty $C$ — large $C$ ≈ hard margin (low bias, high variance), small $C$ tolerates more errors for a wider margin. Equivalent to hinge loss + L2.

$\min_{w,b}\tfrac12\lVert w\rVert^2 + C\sum_i \xi_i \;\text{ s.t. }\; y_i(w^\top x_i+b)\ge 1-\xi_i,\;\xi_i\ge0$Soft-margin (C-SVM): trade margin width against classification violations.
Dual formulation

The Lagrangian dual expresses the solution purely through inner products of data points — setting up the kernel trick in Module 5. Only support vectors get non-zero dual coefficients $\alpha_i$, so the final classifier $f(x)=\sum_i\alpha_i y_i\langle x_i,x\rangle+b$ depends on a sparse subset of the data.

$\ell_{\text{hinge}}(y,f)=\max(0,\,1-y\,f(x))$Hinge loss: zero once a point is correct with margin $\ge1$, linear in the violation otherwise.
Worked mini-example (hinge loss)True label $y=+1$. If $f(x)=1.5$ the point is correct beyond the margin → loss $\max(0,1-1.5)=0$. If $f(x)=0.3$ it is correct but inside the margin → loss $0.7$. If $f(x)=-0.4$ it is misclassified → loss $1.4$. Soft-margin SVM = minimise mean hinge loss $+\,\tfrac{1}{2C}\lVert w\rVert^2$.
Pitfall: $C$ is inverse-regularization — large $C$ punishes every violation and overfits noisy data (low bias/high variance); small $C$ over-smooths. Tune it by cross-validation, never by training error.
Key idea: $C$ is the SVM's bias–variance dial, and the dual's reliance on inner products is exactly the door kernels walk through.

Connects to: "hinge + L2" is one instance of the loss+regularizer template formalised in S12; the inner-product form is the literal entry point for kernels (S17–18).

SVM demo (try the C slider)
  • ESL §12.2–12.3 — SVMs & the optimisation problem (slack variables, the dual, the C parameter).
12

Losses and regularizers

Objective: unify models through their loss + penalty.

Losses

A common lens: squared, logistic, hinge, absolute. Each defines a different notion of "wrong" and a different optimum.

Regularizers

Penalise coefficient magnitude to trade bias for reduced variance. Ridge (L2) shrinks smoothly; Lasso (L1) drives coefficients to exactly zero, performing feature selection.

$\hat{\beta}=\arg\min_\beta \;\lVert y - X\beta\rVert_2^2 + \lambda\,\lVert\beta\rVert_2^2 \;(\text{ridge}),\qquad +\,\lambda\lVert\beta\rVert_1\;(\text{lasso})$Penalised regression: $\lambda$ controls the strength of shrinkage.
Worked mini-example (ridge shrinkage)For orthonormal features the ridge solution is the OLS estimate scaled down: $\hat\beta_j^{\text{ridge}}=\hat\beta_j^{\text{OLS}}/(1+\lambda)$. So an OLS coefficient of $4$ with $\lambda=1$ becomes $2$; with $\lambda=3$ it becomes $1$. Lasso instead soft-thresholds: $\hat\beta_j=\operatorname{sign}(\hat\beta_j^{\text{OLS}})\max(0,|\hat\beta_j^{\text{OLS}}|-\lambda)$, so a coefficient of $0.5$ with $\lambda=0.5$ is set to exactly $0$ — that is feature selection.
Pitfall: regularizers assume comparable feature scales — always standardise first, or the penalty falls unevenly. Note too that the intercept is conventionally left un-penalised.
Key idea: "model = loss + regularizer." L2 shrinks; L1 selects. Tuning $\lambda$ is tuning complexity.

Connects to: the same L2 penalty is the SVM's $\lVert w\rVert^2$ (S10) and weight decay in nets (S6); choosing $\lambda$ is choosing a point on the bias–variance curve (S13) via cross-validation.

Regularization demo Loss functions demo
  • ESL §3.4 — shrinkage methods (ridge, lasso) (closed forms & the geometry of the L1/L2 balls).
13

Bias–variance decomposition

Objective: explain generalisation error formally.

Bias–variance decomposition

Expected test error splits into (squared) bias, variance, and irreducible noise. Simple models: high bias, low variance; complex models: low bias, high variance. The best complexity minimises their sum.

$\mathbb{E}\big[(y-\hat{f}(x))^2\big] = \underbrace{\big(\mathbb{E}[\hat{f}(x)]-f(x)\big)^2}_{\text{bias}^2} + \underbrace{\operatorname{Var}[\hat{f}(x)]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}$The fundamental decomposition of expected squared prediction error.
Worked mini-example (k-NN bias–variance)For k-NN regression with noise variance $\sigma^2$, the variance of the prediction is $\sigma^2/k$ and the bias grows with $k$ (averaging over ever-more-distant, ever-less-relevant neighbours). At $k=1$: variance $\sigma^2$, near-zero bias. At $k=n$: variance $\sigma^2/n$ but the prediction is the global mean — huge bias. The test-error minimum sits at an intermediate $k$.
Pitfall: you can only estimate this decomposition on held-out data — training error is a downward-biased estimate of test error and always favours the most complex model. Use cross-validation or a separate test set.
Key idea: you can't drive both bias and variance to zero — cross-validation finds the sweet spot in between.

Connects to: this single equation explains every "dial" in the course — $k$ (S1), tree depth (S9), $C$ and $\gamma$ (S10–18), $\lambda$ (S12) — and motivates why bagging (S14) attacks variance while boosting (S15) attacks bias.

Bias–variance demo Cross-validation demo
  • ESL §7.2–7.3 — bias, variance & model complexity (the decomposition, derived and applied).
  • ESL §7.10 — cross-validation (how to estimate test error honestly).
Module 4

Ensemble Methods

Sessions 14–16 · combining many learners

Three ways to combine weak/high-variance learners into a strong one: bagging (parallel, variance-reducing), boosting (sequential, bias-reducing), and stacking (meta-learning over heterogeneous models).

Module learning outcomes
  • Explain how bootstrap aggregation reduces variance, and how random forests de-correlate trees.
  • Derive the AdaBoost reweighting and its bias-reduction effect.
  • Build a stacked ensemble with a meta-learner.
14

Ensemble methods I: Bagging

Objective: reduce variance by averaging over bootstrap resamples.

Bagging

Train many high-variance learners on bootstrap samples of the data and average their predictions. Variance falls while bias stays roughly fixed.

$\hat{f}_{\text{bag}}(x)=\frac{1}{B}\sum_{b=1}^{B}\hat{f}^{*b}(x)$Bagging: average over $B$ models fit on bootstrap resamples.
Random forests

Bagging + a random feature subset at each split de-correlates the trees, cutting ensemble variance further than bagging alone.

Worked mini-example (variance of an average)Averaging $B$ predictors each with variance $\sigma^2$ and pairwise correlation $\rho$ gives variance $\rho\sigma^2+\frac{1-\rho}{B}\sigma^2$. As $B\!\to\!\infty$ the second term vanishes but the floor $\rho\sigma^2$ remains — so reducing correlation $\rho$ (random forests' feature subsampling) matters as much as adding trees. Also note a bootstrap sample omits $\approx(1-1/n)^n\to 1/e\approx37\%$ of points, the out-of-bag set used for free validation.
Pitfall: bagging barely helps a low-variance, high-bias learner (e.g. a stump or linear model) — there is little variance to average away. It pays off for unstable learners like deep trees.
Key idea: averaging independent-ish predictors shrinks variance ∝ 1/B; the trick is keeping them as uncorrelated as possible.

Connects to: this is the bias–variance theory of S13 in action — bagging targets the variance term and leaves bias unchanged; contrast with boosting (S15), which targets bias.

Bagging demo Random forest demo
  • ESL §8.7 (bagging), ch. 15 (random forests) (de-correlation, OOB error, variable importance).
15

Ensemble methods II: Boosting

Objective: reduce bias by sequentially focusing on hard examples.

Boosting / AdaBoost

Fit weak learners (e.g. stumps) in sequence, re-weighting misclassified points so later learners concentrate on them; combine with weights tied to each learner's accuracy.

$\alpha_t=\tfrac12\ln\!\frac{1-\varepsilon_t}{\varepsilon_t},\qquad w_i\leftarrow w_i\,e^{-\alpha_t y_i h_t(x_i)}$AdaBoost: learner weight $\alpha_t$ from its error $\varepsilon_t$; sample weights up-weight mistakes.
Worked mini-example (an AdaBoost weight)A weak learner with weighted error $\varepsilon_t=0.3$ earns $\alpha_t=\tfrac12\ln\frac{0.7}{0.3}\approx0.42$. Misclassified points are multiplied by $e^{+0.42}\approx1.52$ and correct ones by $e^{-0.42}\approx0.66$ (then renormalised), so the next learner concentrates on what this one missed. A learner at chance ($\varepsilon=0.5$) gets $\alpha=0$ — no vote; one below chance gets a negative $\alpha$ (its predictions are flipped).
Pitfall: because boosting chases hard points, it can fit label noise and outliers if run too long — early stopping / shrinkage (a learning rate on $\alpha_t$) is the usual guard. Modern gradient boosting (XGBoost) generalises this to arbitrary differentiable losses.
Key idea: where bagging averages independent learners, boosting builds a dependent committee that fixes its own errors — driving down bias.

Connects to: AdaBoost is forward stagewise additive modelling under exponential loss (S12 loss view); it complements bagging's variance reduction (S14) on the bias side of S13.

AdaBoost demo
  • ESL ch. 10 — boosting & additive trees (AdaBoost as additive modelling, gradient boosting).
16

Ensemble methods III: Stacking

Objective: combine heterogeneous models with a meta-learner.

Stacking

Train several different base models, then train a meta-model on their (cross-validated) predictions to learn the best combination. Captures complementary strengths that a single family misses.

$\hat{y}=g\big(\hat{f}_1(x),\hat{f}_2(x),\dots,\hat{f}_M(x)\big)$Stacked generalisation: a meta-learner $g$ over base-model outputs.
Pitfall: the meta-learner must be trained on out-of-fold base predictions. If you feed it base models' in-sample predictions, it sees leaked information, over-trusts overfit bases, and collapses on new data.
Key idea: let a second-level model learn how to trust each base learner, rather than averaging them blindly.

Connects to: stacking pays off most when base models are diverse (a tree, an SVM, a net make different errors) — the same de-correlation principle that powers random forests (S14), now across model families.

Random forest demo Boosting demo
  • ESL §8.8 — model averaging & stacking (stacked generalisation & Bayesian model averaging).
Module 5

Kernel Methods

Sessions 17–18 · non-linearity without explicit features

Generalise linear methods to non-linear ones by implicitly mapping data to high-dimensional spaces. Kernels compute inner products in that space without ever forming the feature map.

Module learning outcomes
  • State the kernel trick and why it follows from the SVM dual.
  • Compare polynomial and RBF kernels and the role of their hyperparameters.
17

Kernel methods I

Objective: introduce the kernel trick.

The kernel trick

If an algorithm depends on data only through inner products, replace $x^\top x'$ with a kernel $K(x,x')=\langle\phi(x),\phi(x')\rangle$ — computing the high-dimensional inner product implicitly, no explicit $\phi$ needed.

$K(x,x')=\exp\!\big(-\gamma\lVert x-x'\rVert^2\big)\;(\text{RBF}),\qquad (x^\top x'+c)^d\;(\text{polynomial})$Two workhorse kernels; RBF $\gamma$ sets locality, polynomial $d$ sets the degree.
Worked mini-example (an explicit feature map)The polynomial kernel $K(x,z)=(x^\top z)^2$ in 2-D equals $\langle\phi(x),\phi(z)\rangle$ with $\phi(x)=(x_1^2,\sqrt2\,x_1x_2,x_2^2)$: expanding $(x_1z_1+x_2z_2)^2=x_1^2z_1^2+2x_1x_2z_1z_2+x_2^2z_2^2$ matches the dot product term by term. The kernel computes this 3-D inner product with one 2-D multiply-and-square — never building $\phi$. The RBF kernel corresponds to an infinite-dimensional $\phi$, which you could never form explicitly.
Pitfall: kernel methods scale with the number of samples, not features — the $n\times n$ Gram matrix costs $O(n^2)$ memory and the solve is up to $O(n^3)$, so plain kernel SVMs struggle past tens of thousands of points.
Key idea: non-linearly separable data often becomes linearly separable in a higher-dimensional lift — and kernels reach that space for free.

Connects to: this only works because the SVM dual (S11) touches data solely through inner products; the same trick kernelises ridge regression (S12), PCA (S22), and clustering.

Kernel trick demo
  • ESL §12.3 — SVMs & kernels (the kernelised SVM).
  • ESL §6.3 — kernel methods (kernels as local weighting).
18

Kernel methods II

Objective: apply kernels in the SVM and beyond.

Kernel SVMs

The kernelised dual SVM produces curved decision boundaries. RBF $\gamma$ trades locality (high $\gamma$ = wiggly, overfit) against smoothness; with $C$ it forms the model's two-knob bias–variance control.

Mercer's condition

Valid kernels correspond to positive-semidefinite Gram matrices, guaranteeing a well-posed convex problem. Kernels also extend to ridge regression, PCA, and clustering.

Worked mini-example (the $\gamma$ knob)For RBF, $K(x,x')=e^{-\gamma\lVert x-x'\rVert^2}$. At $\gamma=0.01$ two points 5 units apart still have similarity $e^{-0.25}\approx0.78$ — the kernel is broad and the boundary smooth. At $\gamma=5$ that similarity drops to $e^{-125}\approx0$ — each point influences only its immediate neighbourhood, giving a wiggly, high-variance boundary that can memorise the training set.
Pitfall: $C$ and $\gamma$ interact, so tune them jointly on a 2-D grid by cross-validation; optimising one then the other can miss the good region. A non-PSD "kernel" breaks the convexity guarantee entirely.
Key idea: the kernel is a similarity measure; choosing it is choosing the geometry in which your data is "simple."

Connects to: $(\gamma,C)$ is the two-knob bias–variance control of S13; Mercer's PSD condition is the same property a covariance matrix has, linking kernels to PCA (S22).

SVM demo (RBF / poly kernels)
  • ESL §12.3.3 — the SVM as a penalisation method (hinge loss + RKHS norm).
Module 6

Clustering

Sessions 19–21 · unsupervised structure discovery

Unsupervised learning: discover groups in unlabelled data. Covers centroid-based (k-means), hierarchical (agglomerative + dendrograms), and density/model-based approaches, plus how to choose the number of clusters.

Module learning outcomes
  • Run k-means and explain its objective and initialisation sensitivity.
  • Build and cut a dendrogram under different linkages.
  • Discuss density-based clustering and cluster-count selection.
19

Clustering I — k-means

Objective: partition data into k compact clusters.

k-means

Alternate between assigning points to the nearest centroid and recomputing centroids as cluster means — greedy descent on within-cluster sum-of-squares. Sensitive to initialisation; k-means++ seeds smartly.

$\min_{\{C_k\}}\sum_{k=1}^{K}\sum_{x\in C_k}\lVert x-\mu_k\rVert^2,\qquad \mu_k=\frac{1}{|C_k|}\sum_{x\in C_k}x$k-means objective (WCSS) and the centroid update.
Worked mini-example (one Lloyd iteration)Points $1,2,9,10$ on a line, $K=2$, init centroids $\mu_A=1,\mu_B=2$. Assign: $1\!\to\!A$; $2,9,10\!\to\!B$. Update: $\mu_A=1,\ \mu_B=(2{+}9{+}10)/3=7$. Re-assign: $1,2\!\to\!A$, $9,10\!\to\!B$; update $\mu_A=1.5,\mu_B=9.5$ — converged, WCSS $=0.5+0.5=1$. A bad init ($\mu_A=1,\mu_B=2$) needed two passes; k-means++ would have spread the seeds and converged faster.
Pitfall: k-means assumes roughly spherical, equal-size clusters and minimises Euclidean WCSS — it fails on elongated or nested shapes and is pulled by outliers. Standardise features, run several inits, and keep the lowest-WCSS solution.
Key idea: k-means is Lloyd's algorithm — fast and intuitive, but it only finds a local minimum, so initialisation matters.

Connects to: k-means is the hard-assignment limit of a Gaussian mixture fit by EM (S21); its spherical-cluster assumption is exactly what hierarchical (S20) and density methods relax.

k-means demo
  • ESL §14.3.6 — k-means clustering (the WCSS objective & Lloyd's algorithm).
20

Clustering II — hierarchical

Objective: build a nested clustering without fixing k.

Hierarchical (agglomerative) clustering

Start with each point as its own cluster and repeatedly merge the closest pair. The merge order forms a dendrogram; cut it at any height for a flat clustering.

$d_{\text{single}}(A,B)=\min_{a\in A,b\in B}\lVert a-b\rVert,\quad d_{\text{complete}}=\max,\quad d_{\text{Ward}}=\Delta\text{WCSS}$Linkage criteria define the inter-cluster distance and shape the merges.
Worked mini-example (a merge)Points $0,1,5$ on a line under single linkage. Closest pair is $\{0,1\}$ (distance 1) → merge. Now distance from cluster $\{0,1\}$ to $5$ is $\min(5,4)=4$, so $5$ merges last at height 4. The dendrogram has a join at height 1 and another at height 4; cutting between them yields two clusters $\{0,1\}$ and $\{5\}$.
Pitfall: linkage choice changes everything — single linkage "chains" and can string distinct clusters together through bridging points, while complete linkage forces compact, equal-diameter clusters. Ward's tends to give the most balanced result. Agglomerative clustering is also $O(n^2)$–$O(n^3)$, so it does not scale to large $n$.
Key idea: the dendrogram encodes a whole family of clusterings at once — you choose granularity by where you cut.

Connects to: unlike k-means (S19) you need not fix $K$ in advance — but you still need a rule (gap statistic, silhouette in S21) to choose the cut height.

Hierarchical clustering demo
  • ESL §14.3.12 — hierarchical clustering (linkages & dendrograms).
21

Clustering III — density & model-based, choosing k

Objective: cluster non-spherical data and pick the number of clusters.

Density-based & mixture models

Methods such as DBSCAN find arbitrarily-shaped clusters and label outliers; Gaussian mixture models give a soft, probabilistic clustering via EM.

Choosing k

Elbow on WCSS, silhouette score, and gap statistic help select the cluster count; validation guards against over-segmentation.

$s(i)=\dfrac{b(i)-a(i)}{\max\{a(i),b(i)\}}\in[-1,1]$Silhouette: $a(i)$ = mean intra-cluster distance, $b(i)$ = mean distance to the nearest other cluster.
Worked mini-example (reading a silhouette)A point sits at mean distance $a(i)=1$ from its own cluster and $b(i)=4$ from the next-nearest cluster: $s(i)=(4-1)/4=0.75$ — well clustered. If instead $a(i)=3,b(i)=3.2$, then $s(i)\approx0.06$ — the point is on a boundary and the chosen $K$ is suspect. Average $s$ over all points; pick the $K$ that maximises it.
Pitfall: DBSCAN's $\varepsilon$ and minPts are unintuitive and one global $\varepsilon$ fails when clusters have very different densities. GMM/EM, like k-means, only finds a local optimum and can collapse a component onto a single point (degenerate zero-variance).
Key idea: there is no single "correct" clustering — the right method and k depend on cluster shape, density, and your downstream goal.

Connects to: GMM's soft, probabilistic assignments are the generative-model view (S26); density estimation underlies anomaly detection.

Anomaly / density demo k-means demo
  • ESL §14.3 — cluster analysis (validity, gap statistic, mixture models).
Module 7

Dimensionality Reduction

Sessions 22–23 · compressing & visualising data

Project high-dimensional data into a few informative dimensions — linearly via PCA, and through distance-preserving embeddings (MDS) and non-linear manifold methods for visualisation.

Module learning outcomes
  • Derive PCA as variance maximisation / reconstruction-error minimisation.
  • Explain MDS and contrast it with non-linear embeddings (t-SNE / UMAP).
22

Dimensionality reduction I — PCA

Objective: find the directions of greatest variance.

Principal Component Analysis

Find orthogonal axes (eigenvectors of the covariance matrix) capturing maximal variance; project onto the top few for a low-dimensional, decorrelated representation. Equivalently minimises reconstruction error.

$\Sigma=\tfrac1n X^\top X,\qquad \Sigma v_k=\lambda_k v_k,\qquad \text{var. explained}=\frac{\lambda_k}{\sum_j\lambda_j}$PCA: eigendecomposition of the covariance; eigenvalues give variance explained.
Worked mini-example (variance explained)Suppose the covariance eigenvalues are $\lambda=(6,3,1)$. PC1 captures $6/10=60\%$ of the variance, PC1+PC2 capture $90\%$. Keeping two of three dimensions loses only 10% of the spread — a justified compression. For data stretched along the $45°$ line, the top eigenvector is $\tfrac{1}{\sqrt2}(1,1)$: PCA finds the diagonal automatically.
Pitfall: PCA is variance-driven, so it is scale-sensitive — standardise (or use the correlation matrix) or a single large-unit feature hijacks PC1. PCA is also unsupervised: the top components need not be the directions most useful for predicting $y$.
Key idea: PCA rotates to the axes where the data spreads most — the first few components keep the signal, the rest is mostly noise.

Connects to: PCA is the linear, global-variance counterpart of the manifold methods in S23, and a principled antidote to the curse of dimensionality (S1); the eigendecomposition reuses the linear-algebra prerequisite.

PCA demo
  • ESL §14.5 — principal components (variance maximisation = reconstruction-error minimisation).
23

Dimensionality reduction II — MDS & non-linear embeddings

Objective: preserve distances / structure in a 2-D map.

Multidimensional scaling (MDS)

Embed objects in low dimensions so pairwise distances match a given dissimilarity matrix — useful when only similarities are known. Minimises a stress objective.

$\text{Stress}=\sum_{iMDS stress: mismatch between embedded and target distances.
Non-linear manifold methods

t-SNE and UMAP preserve local neighbourhood structure for visualisation, revealing clusters that linear PCA may miss.

Worked mini-example (classical MDS = PCA)Given only a pairwise distance matrix, double-centre it ($B=-\tfrac12 J D^{(2)} J$) and take the top eigenvectors of $B$: the resulting embedding is exactly the PCA projection of the original points. So when distances are Euclidean, classical MDS and PCA coincide — they differ only when the input is a non-Euclidean dissimilarity.
Pitfall: t-SNE distorts global geometry — between-cluster distances and cluster sizes in a t-SNE plot are not meaningful, and results swing with the perplexity setting. Use it to spot structure, never to read off quantities.
Key idea: PCA preserves global variance; MDS preserves distances; t-SNE/UMAP preserve local neighbourhoods — match the method to what you need to keep.

Connects to: these embeddings are how you see the clusters of Module 6 and sanity-check high-dimensional data before modelling.

MDS demo PCA demo
  • ESL §14.8 — multidimensional scaling (classical vs. non-metric MDS).
  • ESL §14.9 — nonlinear dimension reduction (local methods, manifolds).
Module 8

Topics in Deep Learning

Sessions 24–27 · modern & generative methods

Four sessions on contemporary deep learning — specialised architectures (CNNs, sequence/attention models), representation learning, and generative models (autoencoders, GANs, diffusion). Exact topics are at the instructor's discretion and may include other advanced material.

Module learning outcomes
  • Recognise key modern architectures and what inductive bias each encodes.
  • Explain the idea behind generative models and representation learning.
24

Topics in deep learning I — convolutional networks

Objective: exploit spatial structure with weight sharing.

CNNs

Convolutional layers share weights across space (translation equivariance) and pool for invariance, drastically cutting parameters for image-like data.

$(f * g)(i,j)=\sum_{m}\sum_{n} f(m,n)\,g(i-m,\,j-n)$Discrete 2-D convolution — the core operation of a conv layer.
Worked mini-example (parameter saving)A fully-connected layer on a $100\times100$ image to 100 hidden units needs $10{,}000\times100=10^6$ weights. A conv layer with one hundred $3\times3$ filters needs only $100\times9=900$ — a thousandfold reduction — yet detects the same edge in every location thanks to weight sharing.
Pitfall: convolution gives translation equivariance, not rotation/scale invariance — a sideways or much larger object can be missed unless you augment the data (rotations, crops, flips) or pool appropriately.
Key idea: the right architectural prior (locality + weight sharing) beats brute-force fully-connected layers on structured data.

Connects to: a CNN is still trained by the gradient descent (S4) and backprop (S5) of Module 1 — only the layer's connectivity changes, encoding a spatial prior.

MLP demo (contrast)
  • Deep Learning ch. 9 — convolutional networks (convolution, pooling, the structured prior).
25

Topics in deep learning II — sequences & attention

Objective: model sequential data and long-range dependencies.

RNNs & attention

Recurrent nets process sequences with shared weights through time (and suffer vanishing gradients); attention and the Transformer let every position attend to every other directly.

$\text{Attention}(Q,K,V)=\operatorname{softmax}\!\Big(\tfrac{QK^\top}{\sqrt{d_k}}\Big)V$Scaled dot-product attention — the heart of the Transformer.
Worked mini-example (why the $\sqrt{d_k}$)If query and key entries are independent with unit variance, the dot product $q^\top k$ over $d_k$ dimensions has variance $d_k$. Without scaling, large $d_k$ pushes the softmax into a near one-hot regime where gradients vanish; dividing by $\sqrt{d_k}$ restores unit-scale logits and keeps the softmax in its responsive range.
Pitfall: self-attention costs $O(n^2)$ in sequence length $n$ (every position attends to every other), so long sequences are expensive — the motivation for the many "efficient attention" variants. RNNs instead suffer the vanishing gradients of S6 over long horizons.
Key idea: attention replaces fixed recurrence with learned, content-based routing of information.

Connects to: the softmax here is the same one from S4; attention is the architecture behind the large language models students use elsewhere.

  • Deep Learning ch. 10 — sequence modeling (RNNs, LSTMs, the vanishing-gradient-over-time problem).
26

Topics in deep learning III — generative models & autoencoders

Objective: learn to represent and generate data.

Autoencoders & VAEs

Learn a compressed latent code by reconstructing the input; variational autoencoders impose a probabilistic latent space enabling sampling.

$\mathcal{L}_{\text{VAE}}=\mathbb{E}_{q(z|x)}[\log p(x|z)] - \mathrm{KL}\big(q(z|x)\,\Vert\,p(z)\big)$The VAE evidence lower bound: reconstruction minus a latent-regularising KL term.
Worked mini-example (linear AE ≈ PCA)A single-hidden-layer linear autoencoder with $k$ units, trained on squared reconstruction loss, learns the same subspace as the top $k$ PCA components — the encoder spans the principal eigenvectors. The nonlinearity and the VAE's KL term are what take you beyond PCA to a smooth, sampleable latent space.
Pitfall: VAEs trade sharpness for a well-behaved latent space — samples are often blurry, and an over-weighted KL term causes "posterior collapse," where the decoder ignores the latent code entirely.
Key idea: generative models learn the data distribution itself, not just a decision boundary — enabling sampling, denoising, and representation learning.

Connects to: the autoencoder bottleneck is non-linear dimensionality reduction (Module 7); the probabilistic latent is the deep cousin of the Gaussian mixture (S21).

PCA demo (linear analogue)
  • Deep Learning ch. 14 (autoencoders), ch. 20 (generative models) (the ELBO & latent-variable models).
27

Topics in deep learning IV — GANs / diffusion & outlook

Objective: survey state-of-the-art generative modelling.

GANs & diffusion

GANs pit a generator against a discriminator in a minimax game; diffusion models learn to reverse a gradual noising process to synthesise samples.

$\min_G \max_D \;\mathbb{E}_{x}[\log D(x)] + \mathbb{E}_{z}[\log(1-D(G(z)))]$The GAN minimax objective.
Worked mini-example (the generator's signal)At the optimum the discriminator outputs $D^\*(x)=\frac{p_{\text{data}}(x)}{p_{\text{data}}(x)+p_g(x)}$, and the generator's objective then equals minimising the Jensen–Shannon divergence between real and generated distributions — i.e. making $p_g$ match $p_{\text{data}}$. In practice the generator is trained to maximise $\log D(G(z))$ (the "non-saturating" form) to keep early gradients strong.
Pitfall: GAN training is a delicate minimax equilibrium — prone to instability and mode collapse, where the generator produces only a few sample types. Diffusion models trade slower sampling for far more stable training, which is why they now dominate image synthesis.
Key idea: modern generation is an adversarial or denoising game — the model learns to produce samples indistinguishable from real data.

Connects to: the discriminator is just a classifier (Modules 1–5); diffusion's score-matching reuses the gradient-based optimisation thread that runs through the whole course.

  • Deep Learning §20.10 — deep generative models (GANs & the adversarial game).
Module 9

Group Projects & Final Exam

Sessions 28–30 · synthesis & assessment

Students present their group projects, then sit the comprehensive final exam.

Module learning outcomes
  • Apply the full course toolkit to a real dataset end-to-end.
  • Communicate methods and results clearly in a report and presentation.
28

Group presentations (I)

Objective: present the group project (report + in-class talk). Part of the 25% Group Project grade.

Presentations

Teams present their problem, chosen methods, evaluation, and findings. GenAI tools may not be used in the group project.

Key idea: choosing and justifying the right algorithm — and assessing it honestly — is the course's central skill.
29

Group presentations (II)

Objective: remaining group presentations and discussion.

Presentations

Continued team presentations with peer and instructor discussion.

Key idea: critically comparing approaches across teams reinforces model-selection judgement.
30

Final Exam

Objective: comprehensive assessment of the whole course. Worth 40%; minimum 3.5 required to pass.

Coverage

All material: foundations, neural networks, trees, SVMs, kernels, ensembles, clustering, dimensionality reduction, and deep-learning topics.

Key idea: integrate theory, mathematics, and practical judgement across every method covered.
  • Comprehensive review of all course readings

Key Concepts — glossary

A quick-reference glossary of the course's central terms.

Supervised learning
Learning a map from inputs to known targets (labels), so as to predict targets for new inputs.
Unsupervised learning
Finding structure (clusters, low-dimensional representations) in unlabelled data.
k-Nearest Neighbours
Lazy classifier that votes among the $k$ closest training points under a distance metric.
Curse of dimensionality
As dimensions grow, data sparsifies and distances concentrate, breaking distance-based methods.
Perceptron
A single linear unit with a step nonlinearity, trained by correcting misclassifications.
Multi-layer perceptron
A feed-forward network of fully-connected layers with nonlinear activations.
Activation function
The per-neuron nonlinearity (sigmoid, tanh, ReLU, GELU…) controlling gradient flow.
Loss function
A scalar measuring prediction error (squared, cross-entropy, hinge) that training minimises.
Gradient descent
Iteratively step parameters opposite the loss gradient, scaled by the learning rate $\eta$.
Backpropagation
Reverse-mode automatic differentiation — the chain rule computing all weight gradients in one backward pass.
Vanishing/exploding gradients
Gradient signal that shrinks or blows up through many layers, stalling deep-net training.
Overfitting
Fitting noise in the training set, hurting generalisation to new data.
Regularization
Penalising model complexity (L2 ridge, L1 lasso, dropout) to trade bias for lower variance.
Bias–variance tradeoff
Expected error = bias² + variance + noise; complexity must balance the two.
Cross-validation
Rotating train/validation folds to estimate test error with less noise than a single holdout.
Decision tree
Recursive axis-aligned splits chosen to reduce impurity (Gini/entropy); interpretable but high-variance.
Bagging
Averaging learners fit on bootstrap resamples to reduce variance.
Random forest
Bagged trees with random feature subsets per split, de-correlating the ensemble.
Boosting
Sequentially fitting weak learners on reweighted data to reduce bias (e.g. AdaBoost).
Stacking
A meta-learner trained on the predictions of heterogeneous base models.
Support Vector Machine
Maximum-margin classifier; only support vectors determine the boundary; $C$ controls slack.
Margin
The gap $2/\lVert w\rVert$ between the SVM hyperplane and the nearest points.
Kernel trick
Replacing inner products with a kernel $K(x,x')$ to operate in a high-dimensional space implicitly.
RBF kernel
Gaussian similarity $e^{-\gamma\lVert x-x'\rVert^2}$; $\gamma$ sets locality.
k-Means
Centroid clustering minimising within-cluster sum-of-squares via Lloyd's algorithm.
Hierarchical clustering
Agglomerative merging into a dendrogram, cut at any height for flat clusters.
PCA
Orthogonal projection onto top-variance directions (covariance eigenvectors).
MDS
Embedding that preserves pairwise distances by minimising a stress objective.
Softmax / cross-entropy
Multiclass probability output and its matching log-loss for classification.
Generative model
A model of the data distribution itself (VAE, GAN, diffusion) enabling sampling.
Hypothesis class
The set of candidate functions a learner can express; its richness sets the bias–variance starting point.
Generalisation
How well a model trained on a sample performs on unseen data — the real goal of learning.
Training vs. test error
Error on data the model fit vs. on held-out data; the gap between them measures overfitting.
Underfitting
A model too simple to capture the signal — high bias, poor on both train and test sets.
Hyperparameter
A setting fixed before training ($k$, $\lambda$, $C$, $\gamma$, depth, learning rate), tuned on validation data.
Feature scaling
Standardising features to comparable ranges; essential for distance-, gradient-, and penalty-based methods.
Learning rate
The step size $\eta$ in gradient descent — too large diverges, too small crawls.
Stochastic gradient descent
Gradient descent on mini-batches; noisier but far cheaper steps, with a useful regularising effect.
Epoch
One full pass of SGD over the training set.
Dropout
Randomly zeroing units during training to prevent co-adaptation — a stochastic regularizer.
Weight initialisation
Choosing starting weights (Xavier/He) to keep signal and gradient scales stable through depth.
Universal approximation
One hidden layer with enough units can approximate any continuous function; depth makes this efficient.
ReLU
The activation $\max(0,z)$ — cheap, non-saturating for positive inputs, but units can "die" at zero.
Ridge (L2)
Penalty $\lambda\lVert\beta\rVert_2^2$ that shrinks coefficients smoothly toward zero.
Lasso (L1)
Penalty $\lambda\lVert\beta\rVert_1$ that drives some coefficients to exactly zero — automatic feature selection.
Hinge loss
$\max(0,1-y\,f(x))$ — the SVM's loss; zero once correct with margin, linear in the violation.
Impurity (Gini / entropy)
Node-level class-mixing measures a tree minimises when choosing splits.
Out-of-bag error
Free validation estimate from the ~37% of points each bootstrap sample omits.
Bootstrap
Sampling the data with replacement to create resampled training sets (the basis of bagging).
Mercer / PSD kernel
A kernel whose Gram matrix is positive semi-definite, guaranteeing a valid feature space and convex problem.
Gram matrix
The $n\times n$ matrix of pairwise kernel values $K(x_i,x_j)$ on which kernel methods operate.
Silhouette score
Per-point cluster quality in $[-1,1]$ comparing intra- to nearest-cluster distance; used to choose $K$.
Linkage
The inter-cluster distance rule (single, complete, average, Ward) shaping a hierarchical clustering.
EM algorithm
Alternating expectation/maximisation steps that fit latent-variable models such as Gaussian mixtures.
Variance explained
The fraction $\lambda_k/\sum_j\lambda_j$ of total variance captured by principal component $k$.
Convolution / weight sharing
Sliding a small shared filter across an input for translation-equivariant, parameter-efficient layers.
Attention
Content-based weighting that lets each position draw on every other — the core of the Transformer.
ELBO
The evidence lower bound a VAE maximises: reconstruction quality minus a latent-regularising KL term.
Mode collapse
A GAN failure where the generator produces only a few sample types, ignoring data diversity.

Bibliography

Recommended texts. All three are freely available online.

Hastie, Tibshirani & Friedman — The Elements of Statistical Learning (2nd ed., 2017)
Springer · ISBN 0387848576 · free PDF

The course's statistical backbone. Definitive treatment of supervised/unsupervised learning: linear methods, trees, SVMs & kernels, ensembles, clustering, and PCA/MDS. The primary reference for Modules 3–7.

Michael A. Nielsen — Neural Networks and Deep Learning (2015)
Determination Press · neuralnetworksanddeeplearning.com

An accessible, intuition-first introduction to neural nets and backpropagation. Best companion to Module 1 (perceptrons, MLPs, the cross-entropy cost, and why deep nets are hard to train).

Goodfellow, Bengio & Courville — Deep Learning (2016)
MIT Press · ISBN 0262035618 · deeplearningbook.org

The comprehensive deep-learning reference. Covers optimisation, regularisation, convolutional and sequence models, and generative models — the main support for Module 8 and the deep-learning portions of Module 1.

Behaviour, attendance, and ethics policies follow the University's Code of Conduct, Attendance Policy, and Ethics Code; the Program Director may provide further indications. See the official SYLLABUS.pdf for full policy text and links.