A full, syllabus-driven map of AI: Statistical Learning & Prediction — every module and all 30 sessions, with the core idea, the key formula, and pointers to the matching interactive demo. Use this page to study ahead, review, and connect the live visualisations on the demos page to where they appear in the course.
This course introduces some of the fundamental algorithms in machine learning, with an emphasis on their theoretical foundations and underlying mathematical principles. Examples using a variety of datasets are presented to build intuition for how the different methods work, and assignments give hands-on experience applying the methods to different types of data. Topics include nearest neighbours, neural networks, support vector machines, trees, clustering, dimensionality reduction, and generative models. Many lectures include a short coding demo in Python.
30 live in-person sessions combining theory, worked examples, and short Python coding demos. Most new techniques are followed immediately by one or more examples.
Calculus for CS · Matrices & Linear Transformations · Probability for CS · Machine Learning Foundations · Computer Programming II.
Luciano Dyballa, Assistant Professor (School of Science & Technology). Ph.D. Computer Science, Yale — machine learning, vision, computational neuroscience. Contact: ldyballa@faculty.ie.edu (office hours on request).
The interactive demos page mirrors this syllabus: k-NN, regression, bias–variance, gradient descent, MLP/backprop, trees, bagging, random forest, AdaBoost, kernels, SVM, PCA, MDS, k-means, hierarchical clustering and more — all running client-side.
Each prerequisite is load-bearing: the course is built on their machinery, and the table below makes the dependency explicit so you can refresh the right tool before the session that needs it.
By the end of the course, students should be able to:
Before each class, students prepare assignments and readings at home. Lectures combine theoretical explanation with practical examples and a short coding demonstration. Active in-class participation is essential to acquiring the skills to understand, implement, and apply each algorithm. Problem sets build intuition behind the theory and the coding skills to implement the algorithms; brief quizzes throughout the semester check understanding of previously taught material and let the instructor track the class's progress.
Estimated student time: Lectures 45 h · Individual studying 45 h · Exercises/async/field work 30 h · Discussions 20 h · Group work 10 h.
Continuous evaluation across four components:
Comprehensive, in Session 30. You must score at least 3.5 on it to pass the course overall — even if your other assessments would otherwise be passing. Format: closed-book, on-campus; mixes short conceptual questions, derivations, and "choose-and-justify" model-selection problems. Tip: practise re-deriving the key formulas on this page from scratch rather than memorising them.
Held in Session 8, covering the foundations and neural-network material from Sessions 1–7. Format: closed-book; expect a hand-worked gradient-descent / backprop step and reasoning about over/under-fitting. Tip: the MLP and gradient-descent demos let you sanity-check the mechanics you will be asked to reproduce.
A written report plus an in-class presentation (Sessions 28–29). Deliverables: a report (problem, data, methods, evaluation, discussion) and slides. GenAI may not be used. Tip: pick a dataset early, justify every modelling choice, and report honest baselines and error bars.
Selected exercises turned in throughout the course (may be presented in class), plus active participation, questions/remarks, punctuality, and class conduct. Tip: turn in every announced exercise even when imperfect — consistency and engagement are what is rewarded.
| Component | What earns top marks | Common point-losers |
|---|---|---|
| Final / Midterm | Correct derivations with each step justified; precise definitions; model choices argued from bias–variance and the data, not by name-dropping. | Memorised formulas applied to the wrong setting; skipped algebra; "it works well" with no reasoning. |
| Group report | Clear problem framing, principled train/validation/test protocol, baselines, appropriate metrics with uncertainty, honest discussion of limitations. | Test-set leakage, single metric with no baseline, unjustified hyperparameters, results without error analysis. |
| Presentation | Tight narrative, one idea per slide, a readable figure of the result, clean handling of questions, all members contributing. | Wall-of-text slides, no visual of the result, running over time, one person presenting everything. |
| Participation | Exercises submitted on time, thoughtful questions and remarks, helping peers reason, punctuality. | Missing submissions, passive attendance, lateness. |
The 30 sessions are grouped below into nine thematic modules. Each session lists its objective, per-topic explanations, the core formula or definition where the material is technical, a "key idea" takeaway, suggested readings, and links to the matching interactive demo.
Establishes the supervised-learning setting and quickly builds the neural-network toolkit: linear classifiers and the perceptron, the multi-layer perceptron, the loss/activation choices that shape training, and the optimisation machinery (gradient descent and backpropagation) that makes deep networks work.
Objective: frame the supervised-learning problem and meet the simplest non-parametric classifier.
The learning setting: features $x$, targets $y$, a hypothesis class, and the goal of generalising to unseen data. Regression vs. classification; training vs. test error.
A lazy, non-parametric rule: predict using the $k$ closest training points under a distance metric. No training phase — all computation happens at query time. Small $k$ gives a jagged, high-variance boundary; large $k$ smooths but biases it.
In high dimensions, data becomes sparse and almost all points are roughly equidistant, so "nearest" loses meaning and volume concentrates in thin shells — distance-based methods degrade and require exponentially more data. Concretely, to keep a fixed fraction $r$ of the data within a neighbourhood in $p$ dimensions, the neighbourhood's edge must span $r^{1/p}$ of each axis: capturing 10% of the data needs edge length $0.10^{1/10}\approx 0.79$ — 79% of the whole range, so the "local" neighbourhood is almost global.
Connects to: $k$ is a bias–variance dial (S13) — small $k$ = low bias/high variance; large $k$ = the reverse — and the distance breakdown motivates the dimensionality reduction of Module 7.
k-NN demoObjective: recall linear decision boundaries and meet the perceptron learning rule.
A hyperplane $w^\top x + b = 0$ separates the input space; the sign of the score decides the class. Geometry of weight vectors, bias/offset, and margins.
An online, mistake-driven algorithm: on each misclassified example, nudge the weights toward correctness. Guaranteed to converge if the data is linearly separable.
Connects to: swap the step for a sigmoid and the mistake-driven update for a gradient and you have logistic regression (S4); stack many such units and you get the MLP (S3).
Logistic regression demoObjective: stack neurons into layers to represent non-linear functions.
Neurons as weighted sums passed through a nonlinearity; layers as composed transformations. A single linear layer cannot solve XOR because the four points $(0,0),(1,1)\!\to\!0$ and $(0,1),(1,0)\!\to\!1$ are not linearly separable — no single line splits them. The universal-approximation theorem guarantees that one hidden layer with enough units can approximate any continuous function, though depth often reaches the same accuracy with far fewer units.
A feed-forward network of fully-connected layers. Hidden layers learn intermediate representations, giving the universal-approximation capacity to carve curved boundaries.
Connects to: the affine-then-activation pattern here is exactly what backprop (S5) differentiates layer by layer, and the activation choice drives the gradient pathologies of S6.
MLP & backprop demo Activation functions demoObjective: define what to minimise and how to start minimising it.
How wrong a prediction is: squared error (L2) for regression, cross-entropy/log-loss for classification, hinge for SVMs. L2 is outlier-sensitive; L1 is robust but non-smooth; Huber blends the two.
Sigmoid and tanh saturate (vanishing gradients); ReLU is cheap but can "die"; LeakyReLU/ELU/GELU/Swish are modern fixes that keep gradients flowing.
Move parameters opposite the gradient of the loss. Learning rate controls step size — too large diverges, too small crawls.
Connects to: the learning-rate behaviour above scales to every network here; saturating activations foreshadow vanishing gradients (S6), and hinge loss reappears as the SVM objective (S10–11).
Loss functions demo Activations demo Gradient descent demoObjective: compute gradients through a deep network efficiently.
The chain rule applied layer-by-layer: a forward pass caches activations, then errors are propagated backward to give every weight's gradient in one sweep. This is what makes training deep nets tractable.
Connects to: this is the algorithm every framework's .backward() implements, and the per-layer $\sigma'$ product directly explains the training pathologies of S6.
Objective: understand why deep nets are hard to train and how to fix it.
Repeated multiplication of derivatives through many layers shrinks or blows up the gradient signal, stalling or destabilising training — especially with saturating activations.
Practical remedies: better activations (ReLU family), careful weight initialisation (Xavier/He), normalisation, regularization (L2, dropout), and adaptive optimisers (momentum, Adam).
Why depth helps representationally (each layer composes features of the previous one, so functions that need exponentially many units in one wide layer can be expressed with linearly many in a deep stack), and the engineering needed to make many layers trainable in practice.
NaN losses. Gradient clipping and careful initialisation are the standard fixes.Connects to: the $\sigma'$ product from backprop (S5) is the root cause here, and L2/dropout regularization links forward to the losses-and-regularizers theory of S12.
Activations demo SGD & softmax demoA consolidation session with exercises pulling together the neural-network material, followed by the midterm exam.
Objective: consolidate Sessions 1–6 through worked problems.
End-to-end recap: from the supervised setting and k-NN through perceptrons, MLPs, losses/activations, gradient descent, backprop, and the training pathologies of deep nets — practised on exercises.
Objective: assess mastery of foundations and neural networks. Worth 25% of the overall grade.
Material from Sessions 1–7: supervised learning, k-NN, curse of dimensionality, linear classifiers/perceptron, MLPs, loss & activation functions, gradient descent, backpropagation, and deep-network training issues.
Moves to interpretable tree models and the linear SVM, then formalises the theory tying them together: the choice of loss and regularizer, and the bias–variance decomposition that explains over- and under-fitting.
Objective: learn axis-aligned, interpretable partitions of feature space.
Recursively split on the feature/threshold that most reduces impurity; each leaf predicts the majority class (or mean) of its region. Deep trees overfit; pruning / depth limits control complexity.
Connects to: the high variance here is precisely what random forests (S14) average away, and trees are the weak learners boosting (S15) reweights.
Decision tree demoObjective: find the maximum-margin separating hyperplane.
Among all separating hyperplanes, choose the one that maximises the margin — the distance to the closest points (support vectors). Maximising the margin is minimising $\lVert w\rVert$.
Connects to: maximising the margin is minimising $\lVert w\rVert^2$ — the same L2 penalty seen in ridge regression (S12); the dual sets up kernels (Module 5).
SVM demoObjective: handle non-separable data with soft margins and the dual.
Introduce slack variables $\xi_i$ to allow violations, controlled by penalty $C$ — large $C$ ≈ hard margin (low bias, high variance), small $C$ tolerates more errors for a wider margin. Equivalent to hinge loss + L2.
The Lagrangian dual expresses the solution purely through inner products of data points — setting up the kernel trick in Module 5. Only support vectors get non-zero dual coefficients $\alpha_i$, so the final classifier $f(x)=\sum_i\alpha_i y_i\langle x_i,x\rangle+b$ depends on a sparse subset of the data.
Connects to: "hinge + L2" is one instance of the loss+regularizer template formalised in S12; the inner-product form is the literal entry point for kernels (S17–18).
SVM demo (try the C slider)Objective: unify models through their loss + penalty.
A common lens: squared, logistic, hinge, absolute. Each defines a different notion of "wrong" and a different optimum.
Penalise coefficient magnitude to trade bias for reduced variance. Ridge (L2) shrinks smoothly; Lasso (L1) drives coefficients to exactly zero, performing feature selection.
Connects to: the same L2 penalty is the SVM's $\lVert w\rVert^2$ (S10) and weight decay in nets (S6); choosing $\lambda$ is choosing a point on the bias–variance curve (S13) via cross-validation.
Regularization demo Loss functions demoObjective: explain generalisation error formally.
Expected test error splits into (squared) bias, variance, and irreducible noise. Simple models: high bias, low variance; complex models: low bias, high variance. The best complexity minimises their sum.
Connects to: this single equation explains every "dial" in the course — $k$ (S1), tree depth (S9), $C$ and $\gamma$ (S10–18), $\lambda$ (S12) — and motivates why bagging (S14) attacks variance while boosting (S15) attacks bias.
Bias–variance demo Cross-validation demoThree ways to combine weak/high-variance learners into a strong one: bagging (parallel, variance-reducing), boosting (sequential, bias-reducing), and stacking (meta-learning over heterogeneous models).
Objective: reduce variance by averaging over bootstrap resamples.
Train many high-variance learners on bootstrap samples of the data and average their predictions. Variance falls while bias stays roughly fixed.
Bagging + a random feature subset at each split de-correlates the trees, cutting ensemble variance further than bagging alone.
Connects to: this is the bias–variance theory of S13 in action — bagging targets the variance term and leaves bias unchanged; contrast with boosting (S15), which targets bias.
Bagging demo Random forest demoObjective: reduce bias by sequentially focusing on hard examples.
Fit weak learners (e.g. stumps) in sequence, re-weighting misclassified points so later learners concentrate on them; combine with weights tied to each learner's accuracy.
Connects to: AdaBoost is forward stagewise additive modelling under exponential loss (S12 loss view); it complements bagging's variance reduction (S14) on the bias side of S13.
AdaBoost demoObjective: combine heterogeneous models with a meta-learner.
Train several different base models, then train a meta-model on their (cross-validated) predictions to learn the best combination. Captures complementary strengths that a single family misses.
Connects to: stacking pays off most when base models are diverse (a tree, an SVM, a net make different errors) — the same de-correlation principle that powers random forests (S14), now across model families.
Random forest demo Boosting demoGeneralise linear methods to non-linear ones by implicitly mapping data to high-dimensional spaces. Kernels compute inner products in that space without ever forming the feature map.
Objective: introduce the kernel trick.
If an algorithm depends on data only through inner products, replace $x^\top x'$ with a kernel $K(x,x')=\langle\phi(x),\phi(x')\rangle$ — computing the high-dimensional inner product implicitly, no explicit $\phi$ needed.
Connects to: this only works because the SVM dual (S11) touches data solely through inner products; the same trick kernelises ridge regression (S12), PCA (S22), and clustering.
Kernel trick demoObjective: apply kernels in the SVM and beyond.
The kernelised dual SVM produces curved decision boundaries. RBF $\gamma$ trades locality (high $\gamma$ = wiggly, overfit) against smoothness; with $C$ it forms the model's two-knob bias–variance control.
Valid kernels correspond to positive-semidefinite Gram matrices, guaranteeing a well-posed convex problem. Kernels also extend to ridge regression, PCA, and clustering.
Connects to: $(\gamma,C)$ is the two-knob bias–variance control of S13; Mercer's PSD condition is the same property a covariance matrix has, linking kernels to PCA (S22).
SVM demo (RBF / poly kernels)Unsupervised learning: discover groups in unlabelled data. Covers centroid-based (k-means), hierarchical (agglomerative + dendrograms), and density/model-based approaches, plus how to choose the number of clusters.
Objective: partition data into k compact clusters.
Alternate between assigning points to the nearest centroid and recomputing centroids as cluster means — greedy descent on within-cluster sum-of-squares. Sensitive to initialisation; k-means++ seeds smartly.
Connects to: k-means is the hard-assignment limit of a Gaussian mixture fit by EM (S21); its spherical-cluster assumption is exactly what hierarchical (S20) and density methods relax.
k-means demoObjective: build a nested clustering without fixing k.
Start with each point as its own cluster and repeatedly merge the closest pair. The merge order forms a dendrogram; cut it at any height for a flat clustering.
Connects to: unlike k-means (S19) you need not fix $K$ in advance — but you still need a rule (gap statistic, silhouette in S21) to choose the cut height.
Hierarchical clustering demoObjective: cluster non-spherical data and pick the number of clusters.
Methods such as DBSCAN find arbitrarily-shaped clusters and label outliers; Gaussian mixture models give a soft, probabilistic clustering via EM.
Elbow on WCSS, silhouette score, and gap statistic help select the cluster count; validation guards against over-segmentation.
minPts are unintuitive and one global $\varepsilon$ fails when clusters have very different densities. GMM/EM, like k-means, only finds a local optimum and can collapse a component onto a single point (degenerate zero-variance).Connects to: GMM's soft, probabilistic assignments are the generative-model view (S26); density estimation underlies anomaly detection.
Anomaly / density demo k-means demoProject high-dimensional data into a few informative dimensions — linearly via PCA, and through distance-preserving embeddings (MDS) and non-linear manifold methods for visualisation.
Objective: find the directions of greatest variance.
Find orthogonal axes (eigenvectors of the covariance matrix) capturing maximal variance; project onto the top few for a low-dimensional, decorrelated representation. Equivalently minimises reconstruction error.
Connects to: PCA is the linear, global-variance counterpart of the manifold methods in S23, and a principled antidote to the curse of dimensionality (S1); the eigendecomposition reuses the linear-algebra prerequisite.
PCA demoObjective: preserve distances / structure in a 2-D map.
Embed objects in low dimensions so pairwise distances match a given dissimilarity matrix — useful when only similarities are known. Minimises a stress objective.
t-SNE and UMAP preserve local neighbourhood structure for visualisation, revealing clusters that linear PCA may miss.
Connects to: these embeddings are how you see the clusters of Module 6 and sanity-check high-dimensional data before modelling.
MDS demo PCA demoFour sessions on contemporary deep learning — specialised architectures (CNNs, sequence/attention models), representation learning, and generative models (autoencoders, GANs, diffusion). Exact topics are at the instructor's discretion and may include other advanced material.
Objective: exploit spatial structure with weight sharing.
Convolutional layers share weights across space (translation equivariance) and pool for invariance, drastically cutting parameters for image-like data.
Connects to: a CNN is still trained by the gradient descent (S4) and backprop (S5) of Module 1 — only the layer's connectivity changes, encoding a spatial prior.
MLP demo (contrast)Objective: model sequential data and long-range dependencies.
Recurrent nets process sequences with shared weights through time (and suffer vanishing gradients); attention and the Transformer let every position attend to every other directly.
Connects to: the softmax here is the same one from S4; attention is the architecture behind the large language models students use elsewhere.
Objective: learn to represent and generate data.
Learn a compressed latent code by reconstructing the input; variational autoencoders impose a probabilistic latent space enabling sampling.
Connects to: the autoencoder bottleneck is non-linear dimensionality reduction (Module 7); the probabilistic latent is the deep cousin of the Gaussian mixture (S21).
PCA demo (linear analogue)Objective: survey state-of-the-art generative modelling.
GANs pit a generator against a discriminator in a minimax game; diffusion models learn to reverse a gradual noising process to synthesise samples.
Connects to: the discriminator is just a classifier (Modules 1–5); diffusion's score-matching reuses the gradient-based optimisation thread that runs through the whole course.
Students present their group projects, then sit the comprehensive final exam.
Objective: present the group project (report + in-class talk). Part of the 25% Group Project grade.
Teams present their problem, chosen methods, evaluation, and findings. GenAI tools may not be used in the group project.
Objective: remaining group presentations and discussion.
Continued team presentations with peer and instructor discussion.
Objective: comprehensive assessment of the whole course. Worth 40%; minimum 3.5 required to pass.
All material: foundations, neural networks, trees, SVMs, kernels, ensembles, clustering, dimensionality reduction, and deep-learning topics.
A quick-reference glossary of the course's central terms.
Recommended texts. All three are freely available online.
The course's statistical backbone. Definitive treatment of supervised/unsupervised learning: linear methods, trees, SVMs & kernels, ensembles, clustering, and PCA/MDS. The primary reference for Modules 3–7.
An accessible, intuition-first introduction to neural nets and backpropagation. Best companion to Module 1 (perceptrons, MLPs, the cross-entropy cost, and why deep nets are hard to train).
The comprehensive deep-learning reference. Covers optimisation, regularisation, convolutional and sequence models, and generative models — the main support for Module 8 and the deep-learning portions of Module 1.