AI: Machine Learning Foundations
Bachelor in Computer Science & Artificial Intelligence (BCSAI) · IE University · a 30-session, syllabus-driven map of the course — every module and every session, with the interactive ml-lab demos cross-linked where they bring a concept to life.
Artificial Intelligence has moved into the mainstream driven by advances in cloud computing, big data, open-source software, and improved algorithms — fundamentally altering how we work, live, and manage businesses. Machine Learning is the cornerstone of that shift: systems that are not directly programmed to solve a problem, but instead build their own program from examples or from trial-and-error experience.
This course introduces the field and sets up a framework of knowledge for making informed analysis of the opportunities and challenges of applying ML in business. It blends a theoretical/conceptual approach with a hands-on technical understanding of every stage of an ML project, implemented in Python (pandas, matplotlib, scikit-learn, TensorFlow, PyTorch) at a basic-to-intermediate level — always keeping the business perspective in view.
Formally, supervised learning fits a function $f_\theta : \mathcal{X} \to \mathcal{Y}$ by choosing parameters $\theta$ that minimise an empirical risk $\hat{R}(\theta) = \frac{1}{n}\sum_{i=1}^{n} L\!\left(y_i, f_\theta(x_i)\right)$ over a training set of $n$ examples — a single idea that recurs, in different guises, through almost every session below.
Learning objectives
The main objective is to introduce students to ML and build a framework for analysing the opportunities and challenges of applying it in business. Specifically, the course aims to:
- Contextual understanding. Acquire a contextual understanding of ML, its history and evolution, to make relevant predictions about its future trajectory.
- Strategic impact. Understand the profound, strategic changes ML introduces in technological and business environments, and appreciate ML as a key source of competitive advantage for firms.
- Solution design. Analyse the features and components of ML solutions, understand the approaches for designing and implementing them, and assess the challenges, difficulties and risks in their successful deployment.
- Application fit. Evaluate the appropriateness of a business application for prediction, optimization, natural language processing, robotics, computer vision and other emergent areas.
- Staying current. Know how to stay continuously updated on new trends and advances in the ML field.
Methodology & assessment
The course progresses from the most basic ML concepts to increasingly difficult problems, contrasting methods to explain when each is most appropriate. It is lecture- and example-based with group in-class discussions, following a tailored path for future AI Managers: from theory, to the tools and techniques that make implementation possible, to deployable business solutions.
Six teaching elements run through the term: lectures (theory, with on-time comprehension checks), examples / tutorials / cases (preparatory for assignments, with what-if interactive analysis), discussions (some announced in advance, requiring preparation), assignments (implement an algorithm in Python / Jupyter and analyse results), exams (one formal test plus practice "test-exams"), and group work (a final project presentation). Individual assignments are delivered in dynamically formed small groups; the final group project uses preference-based grouping. GenAI tools are allowed — with acknowledgement — for research, coding and exam practice, but never during the final exam.
Learning-activity weighting
Assessment weighting
Assessment components in detail
Deliverable: a concept-summary review examination written in class without GenAI. Evaluation: most questions are drawn from the practice "test-exam" pool. It is formally "not a final exam" in the continuous track — its score is added to the other items and carries no minimum passing grade.
Deliverable: Python Jupyter notebooks implementing an algorithm end-to-end, with a written summary and analysis of results; review criteria are published per task. Evaluation: done jointly in dynamically formed small groups (no pair repeats); collaboration is enforced inside the group only. Some may run as a "bake-off" scored on held-out test data.
Deliverable: prepared, professional-level engagement in every session — questions, discussion and group interaction. Evaluation: a fixed 4 points per session; "come prepared as if it were a meeting in your company."
Deliverable: the final group project (MLF_22 onward), with preference-based grouping. Evaluation: applies the full pipeline to a business problem with explainability and interpretability in focus.
Deliverable: a presentation of the group project with discussion and Q&A. Evaluation: communication quality and depth of the work shown.
Program — 30 sessions
The schedule below mirrors the syllabus exactly: knowledge blocks integrate theory first, then practice/tutorials, then assignments of assorted complexity. Filled timeline dots mark hands-on practice/assignment sessions. Where a concept maps to a live demo, a demo ↗ tag links straight to it. Each session below carries the core formula or definition it rests on, plus a short key-idea takeaway. (The schedule is tentative; pace adapts to the group and to recent advances in the field.)
The opening block frames what intelligence and learning mean computationally, then walks the lifecycle of a real ML project from scoping through data preparation to feature engineering. The message: most of the work — and most of the risk — lives before any model is trained.
- Explain how learning systems differ from explicitly programmed ones, and why some problems are intractable.
- Lay out the stages of an ML project and the roles a cross-functional team needs.
- Diagnose and repair common data issues — missing values, outliers, leakage — and encode variables correctly.
- Construct, transform and select features, and reason about dimensionality.
The basic concepts about AI and its application in ML are introduced.
- Intelligence & knowledge representation: what it means for a system to "know" and act on knowledge.
- Basic algorithms & intractable problems: why brute-force search fails as problems scale.
- Heuristics: approximate strategies that trade guaranteed optimality for tractable runtime.
Key idea: a learner replaces an explicit program with an optimisation problem over examples.
Reading: Burkov Ch.1 sets terminology and the "what is ML engineering" framing for the whole course.
Exploration of the general pipeline for ML projects.
- Scope & goals: translating a business question into a measurable prediction target.
- Domain knowledge & teams: assembling the multidisciplinary skills a project needs.
Key idea: define success and feasibility before touching data — most ML projects fail in the framing, not the math.
Reading: Burkov Ch.2 — prioritising projects, estimating complexity, and team/role planning.
Analysis of typical data issues and how to fix them.
- Missing values & imputation: replacing gaps (e.g. with the mean/median) without distorting signal.
- Outliers & leakage: spotting anomalous points and stopping target information from leaking into features.
- Transformation: one-hot encoding and related methods to make data model-ready.
Key idea: data leakage — letting the model peek at the answer — is the most common cause of "too good" results that collapse in production.
Reading: Burkov Ch.3 on data quality, leakage and partitioning. Demo: see scaling change a model's decision boundary live.
The concept of a feature, and how to build and select features.
- Extraction: deriving informative variables from raw data.
- Selection: keeping the features that matter (filter, wrapper, embedded methods).
- Dimensionality: complexity analysis and reduction to fight the curse of dimensionality.
Key idea: good features often beat fancy models; reducing dimensions can improve both speed and generalization.
Assignment (Data Preparation): clean, encode and feature-engineer a real dataset in a notebook. Reading: Burkov Ch.4. Demo: watch variance concentrate along principal axes.
With clean data in hand, the course turns to the algorithm zoo: how to organise ML methods into families, how supervised models are actually trained by minimising a loss, and how to run a first end-to-end case from exploration to validation.
- Classify algorithms along the main axes (supervised/unsupervised, parametric/non-parametric, instance/model-based).
- Describe how loss functions and gradient descent drive training.
- Distinguish regression from classification and choose appropriate performance metrics.
- Recognise overfitting and reason about generalization on held-out data.
A taxonomy of the main families of ML algorithms and the principles behind them.
- Supervised vs unsupervised; parametric vs non-parametric.
- Instance-based vs model-based; manual feature extraction vs representational methods.
- Reflex vs state/variable-based models.
- General principles: loss functions and gradient descent.
Key idea: nearly every algorithm is "a model class + a loss + an optimiser"; the taxonomy just varies those three choices.
Demos: gradient descent shows the update rule converging; k-NN illustrates a non-parametric, instance-based learner with no training step.
General architecture and training methods for supervised models.
- Task analysis: regression (continuous target) vs classification (discrete label).
- Performance metrics appropriate to each task.
- Overfitting & generalization: fitting signal, not noise.
Key idea: the loss you choose is your definition of "good" — pick it to match the task and the cost of errors.
Reading: Burkov Ch.5–6 (Supervised Model Training, Parts 1–2). Demos: logistic regression fits a probability boundary; train/test split exposes overfitting.
An end-to-end practical case, from EDA to model training and validation.
- Exploratory data analysis through to validation.
- Comparing classical algorithms (k-NN, trees, linear models) on the same task.
Key idea: a clean, reproducible pipeline (EDA → split → fit → validate) matters more than any single algorithm.
Assignment (70 pts): deliver a notebook that runs the full classical-ML pipeline and compares models; may be run as a bake-off on held-out data. Demos: trees and k-NN as contrasting model families.
This block adds learning without labels (clustering, dimensionality reduction), confronts the messy realities of real data (imbalance, sampling bias, uncertainty), and shows how to compose and tune models into production-grade pipelines.
- Apply PCA and clustering, and measure unsupervised results with suitable metrics.
- Handle imbalanced classes and sampling bias, and distinguish epistemic from aleatoric uncertainty.
- Build pipelines, combine models with ensembles, and tune hyper-parameters systematically.
- Carry a business application end-to-end and explain the strategy behind it.
General architecture and algorithmic approach to learning without labels.
- PCA & clustering: the two main unsupervised workhorses.
- Metrics & uses: how to judge structure with no ground-truth labels.
- Similarity & entropy: functions that quantify distance and disorder.
Key idea: without labels, "good" means compact, well-separated structure — defined entirely by a chosen similarity measure.
Demos: k-means iterates centroid assignment live; PCA reduces dimensions while preserving variance.
Tackling the messier realities of real-world data.
- Imbalanced classes: resampling, class weights and threshold tuning.
- Sampling biases: when the training sample misrepresents the population.
- Uncertainty: epistemic (reducible, model) vs aleatoric (irreducible, data noise).
Key idea: on imbalanced data, accuracy lies; choose metrics and sampling that reflect the cost of each error type.
Demo: the confusion matrix shows how precision and recall trade off as the decision threshold moves.
Dissection and discussion of sample solutions.
- Sample-solution walk-through: what a strong submission looks like.
- Common difficulties: the recurring traps and how to avoid them.
Key idea: reviewing graded work against a reference solution is one of the fastest ways to internalise good practice.
Composing and optimising models for production.
- Ensembles & pipelines: chaining transforms and combining many models.
- Hyper-parameter tuning: grid/random search and cross-validated selection.
- AutoML & Bayesian optimization of parameters.
Key idea: combining weak, diverse learners — and tuning them as one pipeline — usually beats hand-optimising a single model.
Demo: bias–variance shows why ensembles help — averaging cuts variance without adding much bias.
An end-to-end practical case of a business application.
- Main concepts in play for a realistic deployment.
- Implementation strategy for a successful rollout.
Key idea: a working business application is a chain of decisions — data, model, metric, deployment — each of which can sink the whole.
How do we know a model is actually good, and how do we keep it good once it leaves the notebook? This block covers rigorous validation, the metric landscape, and the realities of deployment and drift.
- Choose and run cross-validation strategies and read the bias–variance tradeoff.
- Select metrics across supervised/unsupervised tasks and multi-class/multi-label settings.
- Describe the model life-cycle and detect data and concept drift after deployment.
An extended analysis of validation methods.
- Cross-validation & leave-one-out: reusing data to estimate generalization.
- Consistency: stable performance across folds.
- Bias–variance tradeoff: underfit vs overfit.
Key idea: k-fold CV averages over many train/test splits, giving a less optimistic, lower-variance estimate of true performance.
Reading: Burkov Ch.7 (Model Evaluation). Demos: bias–variance and train/test split make the decomposition tangible.
Review and comparison of performance metrics across paradigms.
- Supervised vs unsupervised metrics.
- Accuracy, F-alpha, sensitivity & specificity.
- Multi-class & multi-label generalization.
Key idea: there is no universal metric — the right one encodes the relative cost of false positives vs false negatives for your problem.
Demos: confusion matrix for threshold-level metrics; ROC/AUC for threshold-independent ranking quality.
Taking a model beyond the notebook and keeping it healthy.
- Model life-cycle: from serving to retraining and retirement.
- Data drift & concept drift: when the world shifts under a trained model.
Key idea: a deployed model is a living system — performance decays as distributions drift, so monitoring is part of the design.
Reading: Burkov Ch.8 (Model Deployment) & Ch.9 (Serving, Monitoring & Maintenance).
The deep-learning block builds the neural network from the perceptron up: forward and backward passes, optimisation and regularization, representation learning, and a first look at agents that learn from reward.
- Explain forward/backward propagation and how automatic differentiation computes gradients.
- Apply activation functions, optimizers and regularizers like dropout.
- Reason about representation learning and well-known architectures.
- State the RL problem and the Bellman optimality equation at an introductory level.
Introduction to the perceptron and multi-layer perceptron.
- Mathematical foundation of a neural network.
- Forward & backward propagation; automatic differentiation.
- Gradient descent variants (momentum, Adam) and key hyperparameters.
- Dropout and other regularization techniques.
Key idea: depth + non-linear activations let networks compose simple functions into rich ones; backprop makes training them tractable.
Demo: gradient descent is exactly the optimiser that trains these networks, one weight update at a time.
How networks learn useful representations of data.
- Representation-learning through practical examples.
- Representational models and learned embeddings.
- Well-known architectures (CNNs, RNNs, transformers) and their uses.
Key idea: deep learning's power is automatic feature engineering — the network discovers the representation that PCA only approximates linearly.
Demo: PCA as a linear baseline for the non-linear representations a network learns.
How to code a neural network from the building blocks.
- Tensors & gradient tapes: the core abstractions in TensorFlow/PyTorch.
- Activations & optimizers: assembling a trainable model.
- Design principles for layer/width/depth choices.
Key idea: a "gradient tape" records operations so the framework can autodiff the loss — you specify the forward pass, it gives you the gradients.
Assignment (70 pts): implement and train a neural network in a notebook (tensors, optimizer, activations) and analyse the results.
Intelligent agents that learn from the state-action-reward paradigm.
- Temporal differences and the Bellman optimality equation.
- From value-function iteration to policy-gradient algorithms.
- A practical example case of use.
Key idea: RL replaces a labelled dataset with a reward signal — the agent learns by acting and observing consequences over time.
The closing block surveys the research frontier — sequential models, transfer and contrastive learning, generative models — then consolidates best practice, examines real-world risks, and runs the final assessment and group presentations.
- Model sequential/time-series data and explain the move from classical methods to deep models.
- Use transfer, fine-tuning and contrastive learning to reuse knowledge across tasks.
- Reason about generative models (GANs, diffusion) and about explainability and risk.
- Deliver and present an end-to-end ML business project.
Modeling data where order and time matter.
- Sequential feature engineering: lags, windows, seasonality.
- Basic techniques and their evolution toward deep models.
- Example-based comparison of approaches.
Key idea: sequence models exploit temporal dependence that i.i.d. methods throw away — order is information.
Reusing learned knowledge across tasks.
- Shallow training: principles and concepts.
- Fine-tuning a pretrained model on a new task.
- Multi-task & transfer perspectives.
Key idea: representations learned on huge datasets transfer — fine-tuning beats training from scratch when target data is scarce.
Applying the solutions explored to different types of problems.
- Explainability & interpretability: understanding why a model predicts what it does.
- End-to-end case analysis.
- ML/DL business-function application.
Key idea: in business, an unexplainable model is often an unusable one — interpretability is a deployment requirement, not a luxury.
Group assignment (70 pts): launch the final group project — apply the full pipeline to a business problem with explainability in focus.
Exploration of some of the novel architectures in this field.
- Contrastive learning & cross-embeddings and their applications.
- Text-to-image and text-to-video.
- GANs & diffusion models — a brief analysis.
Key idea: generative models learn the data distribution itself, enabling synthesis — a different goal from the discriminative models of earlier modules.
Examination practice (40 pts): a "test-exam" assignment — scored on number attempted, average result and improvement; primes you for the formal exam format.
Advanced practice recap consolidating the deep-learning material.
- Recap & reinforcement of advanced practical techniques.
Key idea: deliberate, repeated practice on harder cases is what converts familiarity into fluency.
A high-level exploration of business cases and applications.
- Applications across industries (finance, healthcare, education, more).
Key idea: the same handful of algorithms recur across industries — what changes is the data, the metric, and the cost of being wrong.
Dissection and discussion of sample solutions.
- Sample-solution walk-through.
- Main difficulties and how to overcome them.
Key idea: a second structured review, now on advanced material, to lock in the deep-learning workflow before assessment.
Where the field is heading and what could go wrong.
- Current research and expected breakthroughs.
- Risks: from adversarial attacks to mesa-optimization misalignment.
Key idea: capability and risk grow together; responsible deployment means anticipating attacks, drift and misalignment.
Reading: Burkov Ch.10 (Conclusion) — where ML engineering is heading.
Concept-summary review examination.
- Not a final exam — its score is added to the other evaluation items.
- No minimum passing grade required; most questions reused from the practice test-exams.
Key idea: the formal test rewards consistent practice — questions are drawn largely from the test-exams you have already rehearsed.
Test / exam (180 pts): written in class, no GenAI; the single largest point item, feeding the 30% final-exam weight.
Presentations, discussion and Q&A of group projects.
- First round of group-project presentations.
Key idea: presenting work clearly is itself a graded ML-manager skill — the syllabus weights communication explicitly.
Group presentation (50 pts): present the project with live discussion and Q&A; feeds the 8% presentation weight.
Final presentations and course wrap-up.
- Second round of group-project presentations, discussion and Q&A.
- Course wrap-up and synthesis.
Key idea: the wrap-up ties the 30 sessions back to the single thread — turning data into reliable, accountable decisions.
Key concepts — glossary
A compact reference for the terms and formulas that recur across the program. Each entry pairs a one-line definition with the symbol or expression used in the sessions above.
- Supervised learning
- Fitting $f_\theta:\mathcal{X}\to\mathcal{Y}$ from labelled pairs $(x_i,y_i)$ to predict labels on new inputs.
- Unsupervised learning
- Finding structure (clusters, low-dim factors) in data with no labels.
- Reinforcement learning
- An agent learns a policy by maximising cumulative reward through interaction with an environment.
- Loss function $L$
- A scalar measuring prediction error; training minimises its average over the data.
- Empirical risk
- The average loss on the training set, $\frac{1}{n}\sum_i L(y_i,f_\theta(x_i))$ — the quantity optimisers descend.
- Gradient descent
- Iterative update $\theta \leftarrow \theta - \eta\nabla L$ that moves parameters downhill on the loss.
- Learning rate $\eta$
- Step size in gradient descent; too large diverges, too small crawls.
- Overfitting
- Modelling noise in the training data so that test performance suffers.
- Generalization
- How well a model performs on unseen data drawn from the same distribution.
- Bias–variance tradeoff
- Error decomposes into bias (underfit), variance (overfit) and irreducible noise.
- Cross-validation
- Estimating generalization by rotating which data folds serve as train vs validation.
- Regularization
- Penalising model complexity (e.g. $L_2$, dropout) to curb variance and overfitting.
- Feature engineering
- Creating, transforming and selecting input variables to make patterns learnable.
- One-hot encoding
- Representing a categorical value as a binary indicator vector.
- Data leakage
- Target information sneaking into features, inflating offline scores that then collapse live.
- PCA
- Linear dimensionality reduction onto directions of maximum variance (top eigenvectors of $\Sigma$).
- k-means
- Clustering that minimises within-cluster squared distance to $K$ centroids.
- Cross-entropy $\mathcal{L}_{\text{CE}}$
- The standard classification loss, $-\sum y\log\hat{p}$, measuring probabilistic mismatch.
- Precision / recall
- Correctness among predicted positives vs coverage of actual positives; combined by $F_\beta$.
- ROC / AUC
- Curve of true- vs false-positive rate across thresholds; AUC summarises ranking quality.
- Ensemble
- Combining multiple models (bagging, boosting, stacking) to reduce error.
- Hyper-parameter
- A setting fixed before training (e.g. depth, $\eta$, $K$), tuned by search or Bayesian optimization.
- Backpropagation
- The chain rule applied through a network to compute loss gradients efficiently.
- Activation function
- Element-wise non-linearity (sigmoid, ReLU) that gives networks expressive power.
- Concept / data drift
- Shift in the input or target distribution after deployment, degrading a live model.
- Transfer learning
- Reusing a model pretrained on one task by fine-tuning it on a related target task.
- Generative model
- A model of the data distribution that can synthesise new samples (GANs, diffusion).
- Epistemic vs aleatoric
- Reducible model uncertainty vs irreducible noise inherent in the data.
Bibliography
The program follows one applied text as its backbone, with a concise companion for quick reference. Each entry is annotated with what it offers and which sessions draw on it.
Andriy Burkov (2020). Machine Learning Engineering. True Positive Inc. ISBN 978-1-9995795-7-9 (Digital).
The most complete applied AI book out there — filled with best practices and design patterns for building reliable, scaling ML solutions. Burkov holds a Ph.D. in AI and led a machine-learning team at Gartner; the book draws on 15 years of solving problems with AI.
Maps to: Ch.1 → S1 · Ch.2 → S2 · Ch.3 → S3 · Ch.4 → S4 · Ch.5–6 → S6 · Ch.7 → S13 · Ch.8–9 → S15 · Ch.10 → S27.
Andriy Burkov (2019). The Hundred-Page Machine Learning Book. ISBN 1-9995795-0-X (Digital).
A successful effort to reduce all of machine learning to 100 pages — well-chosen topics across theory and practice, a solid introduction for practitioners.
Maps to: a concise companion for the algorithm and training material of Modules II–V (S5–S19) — useful quick reference before assignments.