AI: Chatbots & Recommendation Engines
The complete course structure — every module and all 30 sessions, with objectives, the core mathematics, key ideas and annotated readings. A study companion to the interactive lab of 25+ live demos.
Course overview
Recommendation engines reshaped entire industries by cutting search costs and improving the user experience: with the sheer variety of products, films and music available today, customers would have little chance of finding the right item without search & recommendation engines. Chatbots, meanwhile, have become a primary channel for interacting with customers, and the rise of LLMs such as ChatGPT, Bing Chat and Bard has opened a new era of interactive search.
This course introduces recommender systems and chatbots, reviews real industry examples in detail, and teaches students to handle, apply and evaluate recommendation and chatbot methods. It covers the theoretical foundations alongside applied, methodological skills. By the end, students can build an end-to-end recommendation solution in Python at the level expected in a large company, and can develop and deploy a working chatbot.
Instructor
Miguel González-Fierro — Principal Data Science Manager at Microsoft Spain.
Electrical Engineer (UC3M), PhD in Robotics (UC3M in collaboration with King's College London), and
graduate of the MIT Sloan School of Management. Former CEO/founder of Samsamia Technologies (visual
search engine for fashion) and founder of the Robotics Society of UC3M.
mgonzalezfierro@faculty.ie.edu · office hours on request
Prerequisites & how this course connects
This is a third-year BCSAI course that assumes you arrive fluent in a few things and pays you back by tying them together:
- Linear algebra — vectors, dot products, norms and matrix multiplication underpin cosine similarity, matrix factorization ($R\approx UV^{T}$) and attention ($QK^{T}$).
- Probability & statistics — expectations, distributions and Bayes' rule power the Bayesian average, bandit exploration and the probabilistic view of ranking.
- Calculus / optimization — gradients and gradient descent are how MF, factorization machines and neural recommenders are actually trained.
- Machine-learning foundations — regression, classification, train/test discipline, regularization and overfitting carry over directly; Session 13 reframes recommendation as supervised learning.
- Intermediate Python — NumPy/pandas and clean code; the engineering module (S8–S9) raises this to production standard.
It connects forward to NLP, deep learning and MLOps courses: the transformer and embedding machinery in Module 5 is the same machinery behind modern language models, and the serving/monitoring ideas in Session 14 generalize to any ML system.
Learning objectives
By the end of the course, students will be able to:
- Understand the foundations. Explain the theoretical basis of recommender systems and chatbots and situate recommendation within the wider machine-learning taxonomy.
- Handle and apply methods. Select, implement and tune the major recommendation paradigms — non-personalized, content-based, collaborative filtering, matrix factorization, hybrid and context-aware.
- Evaluate rigorously. Measure systems with regression, classification, ranking and beyond-accuracy metrics, choose models, and operationalize them responsibly.
- Build production-grade solutions. Engineer an end-to-end recommendation pipeline in Python at the level required in a large company, following sound development and MLOps practice.
- Develop a chatbot. Implement and deploy a working chatbot, including modern LLM-based and retrieval-augmented approaches.
- Think critically & ethically. Reason about bias, fairness and feedback loops, and the societal impact of deployed recommendation and conversational systems.
Teaching methodology & assessment
IE University's method is collaborative, active and applied: students participate throughout to build knowledge and sharpen skills, while the professor leads and guides. The total workload is 150 hours across the following activity mix.
Learning-activity weighting
Estimated effort: lectures 30h · discussions 20h · exercises 20h · group work 60h · individual study 20h.
Assessment weighting
Group project — 35%
The flagship deliverable: an end-to-end recommendation (and/or chatbot) solution in Python, defended across three checkpoints — project discussion (S6–S7), mid-term presentation (S15–S16) and final presentation (S24–S25). Mirrors how a real data-science team scopes, builds and ships.
Deliverable format: a clean GitHub repository (data/src/notebooks/tests + README, reproducible from a fresh clone) plus a slide deck defended live at each checkpoint.
Rubric (indicative): problem framing & baseline 20% · modelling soundness 25% · evaluation rigor (honest, task-matched metrics) 25% · engineering/reproducibility 15% · communication of the data→model→eval→product story 15%.
Tips: beat a simple baseline before chasing a fancy model; report where the model fails; keep the repo runnable.
Final exam — 30%
Comprehensive written exam (S30) over theory and methods: paradigms, similarity, matrix factorization, evaluation metrics, bias, and chatbot/LLM fundamentals.
Format: closed/comprehensive written exam; expect definitions, short derivations (e.g. compute cosine similarity, DCG/NDCG or an MF update by hand) and conceptual "which method and why" questions.
Tips: revise the formulas in each session's box and be able to reproduce one worked example per topic; the exam rewards connections between topics, not isolated definitions.
Individual contributions — 25%
Each student's measurable, individual share of the group work and exercises — assessed separately from the team grade so that real personal effort is rewarded.
Evidence: attributable Git commits/PRs, owned modules, and the contribution plan agreed in S7.
Tips: commit under your own identity with clear messages; own a coherent slice (e.g. evaluation, or the data pipeline) end-to-end rather than scattering tiny edits.
Individual participation — 10%
Active, prepared engagement in lectures and discussions across the semester.
Assessed on: quality (not just quantity) of contributions, evidence you did the reading, and constructive peer feedback during the project-discussion and presentation sessions.
Tips: come with one prepared question per reading; engaging with other teams' presentations counts.
Full program — 30 sessions
All thirty sessions are live and in-person. They are grouped below into six thematic modules. Each module opens with an overview and intended learning outcomes; every session is a timeline item with its objective, topic explanations, core formulas/definitions where technical, a key-idea callout, annotated readings, and links to the matching interactive demos.
Foundations & the recommendation problem
What recommender systems and chatbots are, how the data behaves, the families of algorithms, how we evaluate them, and scoping the group project. This module builds the conceptual map for everything that follows.
Module learning outcomes
- Frame recommendation as learning a utility function and distinguish it from classification/regression.
- Characterize recommendation data: explicit vs implicit feedback, sparsity and the cold-start problem.
- Compare the five recommender families and know when each applies.
- Select appropriate evaluation metrics and an operationalization strategy.
- Scope a realistic end-to-end group project.
Course logistics, organization & intro orientation
Objective: understand how the course runs, the assessment structure and the group-project expectations.
- Syllabus & assessment map: the 35/30/25/10 split, the three project checkpoints, attendance and re-sit rules.
- GenAI policy: encouraged but critically — verify, refine, and acknowledge use.
- Team formation & tooling: forming project groups and previewing the Python/GitHub stack used later.
- Read · Aggarwal, Recommender Systems: The Textbook, ch. 1 (an introduction to recommender systems) to set context.
Introduction to recommendation & chatbots
Objective: define the recommendation problem formally and survey where recommenders and chatbots create business value.
- The utility function: a recommender learns a score $g(u,i)$ for every (user, item) pair and, for each user, returns the highest-scoring items. Everything in the course — neighborhoods, latent factors, neural nets — is a different way to estimate this one function. Intuitively, $g$ encodes "how much would user $u$ like item $i$?" and the whole pipeline exists to fill in the unobserved entries of a giant, mostly-empty table.
- Rating vs ranking: predicting a numeric utility (regression, e.g. "4.2 stars") is a different task from producing the right ordered list (ranking). A model can have great RMSE yet rank badly, because users only ever see the top of the list — which is why ranking metrics (Session 5) often matter more than rating error.
- Business framing: recommenders reduce search cost, improve UX and surface the long tail (the many niche items no human could browse); chatbots add an interactive search/serving channel. Concretely, ~⅓ of Amazon purchases and the majority of Netflix viewing originate from recommendations.
- Read · Aggarwal ch. 1 — the goals of recommender systems and the rating-vs-ranking distinction. Skim · Li et al., Frontiers and Practices intro for the industry view (why reco is a revenue lever, not a feature).
Data in recommendation systems
Objective: understand the raw material of a recommender — feedback signals, their structure and their pathologies.
- Explicit feedback: deliberate signals such as ★ ratings, thumbs and likes. The meaning is unambiguous (a 5 really means "loved it"), but users rate only a tiny fraction of what they consume, so explicit data is precious and scarce — and biased toward people who bother to rate.
- Implicit feedback: behavioural traces — clicks, views, dwell time, add-to-cart, purchases, skips. Abundant and cheap, but ambiguous: a click can be a misclick, and a non-click can mean "unseen" rather than "disliked". A common modelling trick is to treat interactions as positive with a confidence that grows with the count (e.g. $c_{ui}=1+\alpha r_{ui}$ in implicit-ALS).
- Sparsity & the long tail: the user-item matrix is typically >99% empty, and interactions follow a power law — a few blockbuster items get most of the feedback while the long tail gets almost none. Worked intuition: 10⁶ users × 10⁵ items = 10¹¹ cells, but if each user touches ~100 items only ~10⁸ are filled — density ≈ 0.1%.
- Missing-not-at-random (MNAR): the entries you observe are not a random sample — users choose what to rate, and the system chose what to show. So absence of a rating is not a negative label, and naive "treat blanks as 0" training systematically distorts the model.
- Read · Aggarwal ch. 2 (rating types, neighborhood data); Concept · implicit-feedback & confidence weighting in Li et al. (Hu, Koren & Volinsky's implicit-ALS is the canonical reference).
Recommendation-systems algorithms overview
Objective: map the landscape of recommender families and their trade-offs before going deep on any one.
- Non-personalized: the same list for everyone — random, or ranked by popularity / average rating. Trivial to build and a surprisingly strong baseline; the right fix for sparse averages is a Bayesian average that shrinks low-count items toward the global mean: $\bar r_i=\frac{C\mu+\sum_j r_{ij}}{C+n_i}$, so a 5.0 from two ratings doesn't outrank a 4.6 from a thousand.
- Content-based: build a profile from the features of items a user liked, then score new items by feature match. Handles brand-new items (no interactions needed) and is explainable ("because you watched sci-fi"), but over-specializes into a filter bubble and can't surprise the user.
- Collaborative filtering: learn purely from feedback patterns across users — "people like you liked this" — with no item metadata. Powerful and serendipitous, but cold-start-prone (needs history) and hurt by sparsity.
- Hybrid: combine the above to cancel each other's weaknesses — weighted (blend scores), switching (use CB when CF lacks data), or mixed (present both). Most production systems are hybrids.
- Context-aware (CARS): extend the pair to $g:U\times I\times C\to\mathbb{R}$ with context such as time, location or device, via pre-filtering, post-filtering or full contextual modelling (e.g. "lunch spots at noon, near me").
- Read · Aggarwal ch. 1 §1.3 (taxonomy) — keep this map handy all term; ch. 4 (content-based) & ch. 2 (neighborhood CF) as previews.
Evaluation, model selection & operationalization
Objective: learn how to judge a recommender offline, choose between models, and reason about serving it in production.
- Regression metrics: MAE, MSE, RMSE and R² for rating prediction. RMSE squares errors, so it punishes a few large mistakes harder than MAE; use it when big misses are costly.
- Classification metrics: precision, recall, F1, accuracy and the ROC/AUC curve for the relevant-vs-not view. In top-K reco, precision@K = (relevant in top K)/K answers "of what I showed, how much was good?", while recall@K answers "of all good items, how much did I surface?".
- Ranking metrics: CG, DCG, NDCG, MRR and Precision@K — here order is everything because users scan from the top down. NDCG discounts gains by log-position and normalizes against the ideal ordering, giving a score in $[0,1]$.
- Beyond accuracy: coverage (what fraction of the catalog ever gets shown), diversity (intra-list similarity), novelty and personalization — dimensions a pure accuracy number is blind to.
- Data splits: random, per-user stratified, and time-based (train on the past, test on the future — the only honest split for a deployed system); offline metrics vs the online A/B-test reality.
- Read · Aggarwal ch. 7 (evaluating recommender systems); Concept · NDCG & offline/online evaluation in Li et al.
Project discussion — presentation of projects project
Objective: present and refine group-project proposals — problem, dataset, baseline and success metric.
- Problem & dataset: who are the users, what are the items, what feedback is available (explicit/implicit)?
- Baseline & metric: choose a non-personalized baseline (popularity / Bayesian average) and the single metric you will improve — matched to the task (rating vs ranking) per Session 5.
- Peer feedback: tighten scope and de-risk early.
Project discussion — presentation of projects project
Objective: complete proposal presentations and lock the project plan and team responsibilities.
- Remaining presentations and consolidated feedback.
- Work plan: milestones aligned to the mid-term (S15–16) and final (S24–25) checkpoints.
- Individual contribution plan: who owns what (feeds the 25% contributions grade).
Engineering practice & the product lens
The professional craft of building recommender/chatbot systems: project and repository setup, Python engineering practice, and a product-management perspective from an invited guest.
Module learning outcomes
- Set up a reproducible Python project with version control and a clean structure.
- Apply software-engineering practices (testing, modularity, environments) to ML code.
- Connect technical choices to product strategy and stakeholder needs.
Development practices — set up project & GitHub lab
Objective: stand up a reproducible project skeleton under version control.
- Repository structure: data / src / notebooks / tests separation; README and config.
- Git & GitHub workflow: branches, commits, pull requests, code review.
- Environments: virtual environments and pinned dependencies for reproducibility.
Development practices — Python practices lab
Objective: write clean, testable Python for data and ML workloads.
- Idiomatic Python & the data stack: NumPy / pandas vectorization, avoiding hidden loops.
- Modularity & testing: functions over scripts, unit tests, type hints, linting.
- Performance basics: sparse matrices for large user-item data.
Invited guest — product management guest
Objective: see how recommendation/chatbot features are prioritized and shipped in industry.
- From metric to roadmap: turning model gains into product decisions.
- Experimentation culture: A/B tests, guardrail metrics, and knowing when offline wins don't ship.
- Stakeholders: aligning data science, engineering and business.
Similarity, matrix factorization & ML at scale
The core recommendation algorithms: similarity-based neighborhood methods, latent-factor models (matrix factorization & factorization machines), applying general ML, and the MLOps that puts them in production — bracketed by the mid-term project checkpoint.
Module learning outcomes
- Implement user-based and item-based KNN collaborative filtering with cosine similarity.
- Train and reason about matrix-factorization / SVD models with biases and regularization.
- Cast recommendation as a general supervised-learning problem (e.g. factorization machines).
- Apply MLOps practices: pipelines, monitoring, retraining and deployment.
Similarity-based methods
Objective: build memory-based collaborative filtering from similarity between users or items.
- Cosine similarity: the cosine of the angle between two rating/feature vectors — it measures direction (taste pattern), ignoring magnitude, so a generous and a stingy rater with the same pattern still look similar. Ranges $[0,1]$ for non-negative vectors. Worked example: $A=[5,0,3],\,B=[4,0,2]$ ⇒ $A\!\cdot\!B=26$, $\lVert A\rVert=\sqrt{34}=5.83$, $\lVert B\rVert=\sqrt{20}=4.47$, so $\cos\theta=26/(5.83\cdot4.47)\approx\mathbf{0.998}$ — near-identical taste.
- Pearson correlation: cosine on mean-centred ratings — subtracts each user's average first, which corrects for rating-scale bias and is often preferred for explicit ratings.
- User-based CF: find users whose history is similar to yours, then recommend what they liked. Intuitive but the user-user matrix shifts constantly and is expensive to keep fresh.
- Item-based CF: find items similar to those you already liked. Item-item relationships are far more stable over time and can be precomputed offline, which is why Amazon's classic engine is item-based.
- KNN prediction: a similarity-weighted average of the neighbours' ratings — closer (more similar) neighbours count more.
- Read · Aggarwal ch. 2 (neighborhood-based CF — derivations of user/item similarity and prediction); Concept · similarity functions in Li et al. Classic: Sarwar et al., "Item-Based Collaborative Filtering".
Matrix factorization & factorization machines
Objective: learn latent-factor models that decompose the sparse rating matrix into low-rank factors.
- Latent factors: approximate the sparse rating matrix as $R\approx UV^{T}$, where each user and each item is a vector of $k$ hidden dimensions (e.g. "amount of comedy", "indie-ness") discovered by the model. A predicted rating is just the dot product $p_u^{T}q_i$ — aligned vectors score high.
- SVD with biases: pure dot products miss that some users rate high and some items are universally liked; adding a global mean $\mu$ and user/item biases $b_u,b_i$ fixes most of this before the factors do any work.
- The objective: minimize squared error on the observed entries only, with L2 regularization to stop the factors from overfitting the sparse data (see formula).
- Learning: regularized gradient descent over observed entries — SGD (one rating at a time) or ALS (fix one factor matrix, solve the other in closed form; embarrassingly parallel).
- Factorization machines: generalize MF to model all pairwise interactions among arbitrary features (user, item, context, side-info), unifying recommendation with regression and handling cold-start via features.
- Read · Aggarwal ch. 3 (model-based CF / latent factor models — full derivation); Koren, Bell & Volinsky, "Matrix Factorization Techniques for Recommender Systems" (the classic, readable survey behind the Netflix Prize).
Applying general machine learning to recommendation
Objective: frame recommendation as a standard supervised-learning problem and bring the full ML toolbox.
- Feature engineering: turn users, items and interactions into a feature vector — user demographics, item metadata, recency/frequency, and text via bag-of-words / TF-IDF or embeddings. The label is the interaction (click, purchase, rating).
- TF-IDF intuition: weight each word by how often it appears in a document (TF) times how rare it is across the corpus (IDF), so common words like "the" get crushed and distinctive words dominate the profile. Worked example: in a 1,000-document corpus, "thriller" appears in 10 of them and 3× in this movie's synopsis ⇒ TF·IDF $=3\cdot\log(1000/10)=3\cdot2=\mathbf{6}$; "the" in all 1,000 ⇒ $\log(1000/1000)=0$, contributing nothing.
- Models: logistic regression (fast, explainable), gradient-boosted trees (XGBoost/LightGBM — the industry workhorse for tabular click prediction), and neural nets.
- Hyperparameter tuning: grid/random search vs Bayesian optimization, which models the score surface and samples where improvement is likely — far fewer trials when each training run is expensive.
- Cold-start via content: because the model scores from features, it can rate a brand-new item or user with zero interaction history — the structural fix for the cold-start problem of Session 3.
- Read · Aggarwal ch. 4–5 (content-based & knowledge-based); Rendle, "Factorization Machines" (the unifying view of MF and regression).
MLOps for recommendation & chatbots
Objective: take a trained model from notebook to a monitored, retrainable production service.
- Pipelines: reproducible, version-controlled training and feature pipelines; data and model versioning so any result can be reproduced and rolled back.
- Serving architectures: batch (precompute lists nightly — cheap, but stale) vs real-time (score on request — fresh, costly) vs multi-stage vs hybrid; the choice is a latency/freshness/cost trade-off.
- Monitoring & retraining: watch for data and concept drift (the world changes, the model goes stale), collect feedback, and retrain on a schedule or a trigger.
- The production funnel: retrieval (millions → thousands, cheap) → filtering (business rules) → scoring (rich model) → ordering/re-ranking (diversity, freshness).
- Read · Li et al., MLOps / system-design chapters; Concept · multi-stage ranking architectures.
Mid-term project presentation project
Objective: present working progress — baseline beaten, first real model, honest metrics.
- Pipeline & baseline results; first personalized model vs the baseline.
- Evaluation: the chosen offline metric and what it does/doesn't capture.
- Risks & next steps toward the final.
Mid-term project presentation project
Objective: finish mid-term presentations and integrate feedback into the final plan.
- Remaining presentations and cross-team feedback.
- Plan adjustment for the back half of the course (DL, sequential, graph, chatbots).
Modern recommendation models
State-of-the-art recommenders: deep learning, sequential models that respect the order of interactions, and graph-based methods that exploit the network structure of users and items.
Module learning outcomes
- Explain neural recommendation architectures and where they beat classical models.
- Model user behaviour as a sequence and recommend the next item.
- Represent recommendation as a graph and apply graph neural networks.
Deep-learning models in recommendation systems
Objective: understand neural recommenders and how embeddings replace hand-crafted similarity.
- Embeddings: learned dense vectors that place similar users/items/words near each other in space; static (Word2Vec/GloVe — one vector per word) vs contextual (BERT — the vector depends on the sentence). The same idea as MF's latent factors, now learned end-to-end.
- Neural CF & two-tower models: replace the fixed dot product with a learned non-linear scorer (NCF). The two-tower design encodes user and item separately so item vectors can be precomputed and searched with fast approximate nearest-neighbour — the standard retrieval architecture at scale.
- Learning to rank: optimize the order directly — pointwise (predict each score independently), pairwise (BPR: rank a positive above a sampled negative), or listwise (optimize a whole-list metric). BPR in words: for an observed item $i$ and an unobserved $j$, push $\hat r_{ui}$ above $\hat r_{uj}$ by maximizing $\ln\sigma(\hat r_{ui}-\hat r_{uj})$; when $\hat r_{ui}\!-\!\hat r_{uj}=2$, $\sigma(2)\approx0.88$, so the pair is already well-ordered and contributes little gradient.
- Dimensionality reduction: PCA to compress and visualize high-dimensional embeddings in 2–3D.
- Read · Li et al., deep-learning-for-recommendation chapters; "Neural Collaborative Filtering" (He et al.).
Sequential recommendation systems
Objective: model the order of a user's interactions to predict the next one.
- Session-based & next-item prediction: given the recent sequence $(i_1,\dots,i_{t})$, predict $i_{t+1}$ — crucial when you only have an anonymous session, not a long user history.
- Architectures: RNN/GRU (GRU4Rec) process the sequence step by step; self-attention/transformer models (SASRec left-to-right, BERT4Rec with masking) let any past item attend to any other and train far faster — the same attention machinery as Session 22.
- Temporal dynamics: tastes drift, items go in and out of fashion, and intent within a session is short-lived — a key Netflix-Prize lesson (time-aware models beat static ones).
- Read · Li et al., sequential-recommendation chapter; SASRec (Kang & McAuley) for the self-attention approach; GRU4Rec (Hidasi et al.) for the RNN baseline.
Graph recommendation systems
Objective: exploit the user-item interaction graph with graph neural networks.
- Bipartite graph: users and items are two node sets, interactions are edges — a graph view of the very same user-item matrix from Session 3.
- Graph neural networks: each node repeatedly aggregates ("message-passes") its neighbours' embeddings; after $L$ layers a node has absorbed information from $L$ hops away. LightGCN strips the GNN down to just neighbour averaging and shows the heavy non-linearities often don't help for reco.
- High-order connectivity: stacking layers captures multi-hop "users who liked X also liked Y, and those users also liked Z" paths that a flat dot product can't see.
- Read · Li et al., graph-based recommendation chapter; LightGCN (He et al.) and NGCF for the message-passing formulation.
Chatbots & large language models
The conversational half of the course: from classical chatbot foundations to modern Q&A methodologies, the fundamentals of LLMs, operationalizing chatbots, and advanced application-building — with the final project defended in the middle of this stretch.
Module learning outcomes
- Describe chatbot architectures from intent/NLU pipelines to generative LLMs.
- Explain autoencoder and autoregressive approaches to Q&A.
- Understand LLM fundamentals: tokenization, attention, pre-training and fine-tuning.
- Build and operationalize an LLM-powered application (e.g. RAG).
Introduction to chatbots
Objective: understand the classical chatbot stack and how conversational systems are structured.
- Chatbot types: rule-based (scripted patterns — predictable, brittle), retrieval-based (pick the best response from a fixed set — safe, can't generalize), and generative (compose a new response — flexible, can hallucinate).
- Intent & NLU: classify the user's intent (e.g.
book_flight) and extract entities/slots (date, destination) from an utterance — a text-classification + sequence-labelling problem, mirroring Session 13's "text as features". - Dialog management: track conversation state and decide the next action/response (the policy), then realize it as text (NLG).
- Read · Alto, Building LLM Powered Applications, early chapters on conversational AI and the classical NLU pipeline.
Q&A modern methodologies: autoencoders & autoregressive algorithms
Objective: contrast the two dominant transformer paradigms behind modern Q&A.
- Autoencoding (BERT-style): bidirectional, masked-language-model pre-training — great for understanding/extraction.
- Autoregressive (GPT-style): left-to-right next-token prediction — great for generation.
- Extractive vs generative Q&A: pointing to a span vs composing an answer.
- Read · "Attention Is All You Need" (Vaswani et al.); BERT (Devlin et al.) for the autoencoding side.
Fundamentals of LLMs
Objective: understand how large language models are built and adapted.
- Tokenization & embeddings: text is split into subword tokens (e.g. BPE) and each is mapped to a vector; positional encodings add word order, which attention itself doesn't capture.
- Self-attention & transformers: every token forms a query $Q$, key $K$ and value $V$; the dot product $QK^{T}$ scores how much each token should attend to every other, softmax turns those scores into weights, and the output is a weighted sum of values. Why the $\sqrt{d_k}$? it rescales the dot products so that, for large $d_k$, they don't blow up and push softmax into vanishing-gradient saturation. Multiple "heads" let the model attend to several relationships at once.
- Pre-training → fine-tuning → alignment: self-supervised pre-training on huge corpora, then instruction tuning and RLHF/DPO to make the model helpful and aligned.
- Prompting: zero/few-shot and chain-of-thought; prompt quality drives output quality, which is exactly the GenAI-policy point from Session 1.
- Read · Alto, LLM-fundamentals chapters; Vaswani et al., "Attention Is All You Need" (the original transformer).
Operationalization of chatbots
Objective: deploy, monitor and safeguard a chatbot in production.
- Serving: latency, cost and context-window management for LLM calls.
- Guardrails & evaluation: hallucination checks, safety filters, and how to measure chatbot quality.
- Feedback & iteration: logging conversations and improving over time.
- Read · Alto, deployment & productionization chapters.
Final project presentation project
Objective: present the finished end-to-end system and defend its design and results.
- Full pipeline & final metrics vs baseline and mid-term.
- Design justification: why these algorithms, features and trade-offs.
- Limitations & ethics of the deployed system.
Final project presentation project
Objective: complete final presentations and peer evaluation.
- Remaining presentations and Q&A.
- Cross-team comparison: what worked across different problems and datasets.
Advanced methods for building chatbot applications
Objective: build grounded, tool-using LLM applications beyond raw chat.
- Retrieval-augmented generation (RAG): embed the query, retrieve the most similar document chunks from a vector store, and condition the LLM's answer on them — so the model quotes your data instead of its frozen, possibly-wrong memory. The retrieval step is, mathematically, the same nearest-neighbour search as two-tower reco (S17).
- Vector databases & embeddings search: store chunk embeddings and serve fast approximate-nearest-neighbour lookups — the retrieval backbone of RAG; chunking strategy and embedding quality dominate end-to-end accuracy.
- Prompt engineering & orchestration: templates, few-shot examples and frameworks (e.g. LangChain) that wire retrieval, prompting and post-processing together.
- Read · Alto, RAG & agent chapters; Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks".
Advanced methods for building chatbot applications (continued)
Objective: extend chatbots with agents, tools and evaluation.
- Agents & tool use: letting an LLM call functions/APIs to act, not just answer.
- Memory & multi-step reasoning; chaining and planning.
- Evaluating LLM apps: task success, faithfulness and cost.
- Read · Alto, agents/tools chapters.
Industry perspective, ethics & final exam
Closing the loop: a second industry guest, a group discussion on ethical AI, and the comprehensive final exam.
Module learning outcomes
- Relate course concepts to real data-science practice.
- Critically evaluate bias, fairness and feedback loops in deployed systems.
- Demonstrate comprehensive mastery in the final exam.
Invited guest — Data Science guest
Objective: connect the course to real-world data-science practice through an industry guest.
- Real systems at scale: war stories on data, modelling and deployment.
- Career & craft: what working on recommenders/chatbots in industry actually looks like.
Group discussion: ethical AI & course wrap-up
Objective: critically examine bias and fairness in recommenders/chatbots and consolidate the course.
- The feedback loop: data → model → serving → user → data; each stage can introduce or amplify bias.
- Bias zoo: selection, exposure, conformity, position, popularity and unfairness biases.
- Position-bias correction and value-aware, multi-objective recommendation (beyond pure engagement).
- Exploration as a remedy: breaking the loop requires deliberately showing uncertain items. A multi-armed bandit balances exploit (the current best) vs explore (uncertain) — e.g. UCB adds an optimism bonus that shrinks as an arm is tried more. Worked intuition: an arm seen only 4 times gets bonus $\sqrt{2\ln 100/4}\approx1.52$, vs one seen 100 times $\sqrt{2\ln100/100}\approx0.30$ — so the rarely-shown item is given a fair chance.
- Wrap-up: synthesizing recommenders + chatbots into one mental model — both estimate a conditional distribution and serve from it under constraints.
- Read · Aggarwal ch. 13 (advanced topics); a fairness-in-recsys survey of your choice.
Final exam exam · 30%
Objective: demonstrate comprehensive mastery of recommendation and chatbot theory and methods.
- Coverage: paradigms, similarity, matrix factorization, evaluation metrics, bias, and chatbot/LLM fundamentals.
- Format: comprehensive written exam worth 30% of the course grade (passing grade 5/10).
Key concepts — glossary
A quick reference to the core vocabulary of the course. Many terms have a live demo in the interactive lab.
- Utility function
- The score $g(u,i)$ a recommender assigns to each user-item pair; recommendation maximizes it per user.
- Explicit feedback
- Deliberate signals like star ratings or likes — clear in meaning but sparse.
- Implicit feedback
- Behavioural signals like clicks, views and purchases — abundant but ambiguous (absence ≠ dislike).
- Cold-start problem
- The difficulty of recommending for new users or new items that have no interaction history.
- Data sparsity
- The user-item matrix is mostly empty; the vast majority of pairs have no observed feedback.
- Non-personalized recommender
- Recommends the same items to everyone, e.g. by random sampling or popularity.
- Bayesian average
- Shrinks an item's mean rating toward the global mean in proportion to how few ratings it has, fixing low-count outliers.
- Content-based filtering
- Recommends items whose metadata/features match a profile built from the user's past likes; handles new items but risks filter bubbles.
- Collaborative filtering (CF)
- Recommends from patterns in feedback across many users, without item metadata; strong but cold-start-prone.
- User-based vs item-based CF
- Neighborhood CF using similar users versus similar items; item-based is more stable and cacheable at scale.
- Cosine similarity
- $\cos\theta=\frac{A\cdot B}{\lVert A\rVert\lVert B\rVert}$ — the angle between two vectors; the standard similarity in CF and content-based methods.
- K-nearest neighbours (KNN)
- Predicts via a similarity-weighted average over the K most similar users or items.
- Matrix factorization
- Approximates $R\approx UV^{T}$, learning low-rank latent user and item factors that capture hidden tastes.
- SVD with biases
- $\hat r_{ui}=\mu+b_u+b_i+q_i^{T}p_u$ — factorization plus global/user/item bias terms.
- Factorization machine
- Generalizes MF to model all pairwise feature interactions, unifying recommendation with regression.
- TF-IDF
- Term frequency × inverse document frequency — weights words by how informative they are in a corpus.
- Embedding
- A learned dense vector representing a user, item or word; static (Word2Vec) or contextual (BERT).
- Hybrid recommender
- Combines several recommenders — weighted, switching or mixed — to offset each one's weaknesses.
- Context-aware RS (CARS)
- Extends $g:U\times I\to\mathbb{R}$ to $g:U\times I\times C\to\mathbb{R}$ via pre-filtering, post-filtering or contextual modeling.
- RMSE / MAE
- Root-mean-square / mean-absolute error — regression metrics for rating prediction; RMSE penalizes large errors more.
- Precision / Recall / F1
- Classification metrics for relevant-vs-not recommendations; F1 is their harmonic mean.
- NDCG
- Normalized Discounted Cumulative Gain — a ranking metric that discounts relevance by position and normalizes against the ideal order.
- MAP / MRR
- Mean Average Precision and Mean Reciprocal Rank ($1/\text{rank}$ of the first hit) — order-sensitive ranking metrics.
- Beyond-accuracy metrics
- Coverage, diversity (ILS), novelty and personalization — quality dimensions accuracy alone misses.
- Learning to rank
- Optimizing the ranking directly: pointwise, pairwise (BPR) or listwise objectives.
- Multi-armed bandit
- Online learning that balances exploiting the current best arm against exploring uncertain ones (ε-greedy, Thompson sampling).
- Feedback loop & bias
- Recommenders train on data they generated; selection, exposure, position and popularity biases can compound over time.
- Production funnel
- Retrieval → filtering → scoring → ordering — narrows billions of candidates while increasing compute per item.
- Intent / NLU
- Natural-language understanding: classifying a user's intent and extracting entities from an utterance.
- Autoencoding vs autoregressive
- BERT-style bidirectional masked modelling (understanding) vs GPT-style next-token prediction (generation).
- Self-attention
- $\text{softmax}(QK^{T}/\sqrt{d_k})V$ — the transformer mechanism letting each token attend to all others.
- RAG
- Retrieval-Augmented Generation — retrieve relevant documents and condition the LLM's answer on them to reduce hallucination.
- MLOps
- Engineering practice for reliably training, deploying, monitoring and retraining ML systems in production.
- Utility matrix
- The (mostly empty) user × item table of observed feedback; recommendation = predicting its missing entries.
- Long tail
- The many niche items that each get little feedback but collectively form most of the catalog; recommenders' job is to surface them.
- MNAR (missing-not-at-random)
- Observed entries aren't a random sample — users and the system both choose what gets rated, biasing offline evaluation.
- Pearson correlation
- Cosine similarity on mean-centred ratings; removes per-user rating-scale bias, often preferred for explicit feedback.
- Latent factor
- A learned hidden dimension (e.g. "amount of comedy") in MF/embeddings that captures taste without being hand-labelled.
- Regularization (λ)
- A penalty on parameter size ($\lambda\lVert\cdot\rVert^2$) that combats overfitting on sparse data; central to MF training.
- SGD vs ALS
- Two ways to fit MF: stochastic gradient descent (one rating at a time) vs alternating least squares (fix one factor, solve the other in closed form, parallelizable).
- BPR
- Bayesian Personalized Ranking — a pairwise loss that ranks an observed item above a sampled unobserved one, $\ln\sigma(\hat r_{ui}-\hat r_{uj})$.
- Precision@K / Recall@K
- Top-K ranking metrics: fraction of the K shown that are relevant, vs fraction of all relevant items that were surfaced.
- DCG / IDCG
- Discounted Cumulative Gain and its ideal (best-possible) value; their ratio is NDCG.
- Coverage
- The fraction of the catalog a recommender ever recommends; low coverage signals popularity bias.
- Diversity / novelty
- How dissimilar (intra-list) and how unexpected the recommended items are — beyond-accuracy quality dimensions.
- Two-tower model
- Separate user and item encoders whose outputs are compared by dot product; item vectors are precomputed for fast ANN retrieval.
- Neural CF (NCF)
- Collaborative filtering with a learned non-linear scoring function replacing the fixed dot product.
- Graph neural network (GNN)
- Learns node embeddings by message-passing over the user-item graph; LightGCN/NGCF capture multi-hop collaborative signal.
- Sequential recommendation
- Predicts the next item from the ordered history of interactions (GRU4Rec, SASRec, BERT4Rec).
- Over-smoothing
- Failure mode where too many GNN layers make all node embeddings converge, destroying personalization.
- Tokenization
- Splitting text into subword units (e.g. BPE) that an LLM maps to embeddings; the input granularity of transformers.
- Positional encoding
- Information added to token embeddings so the order-agnostic attention mechanism can use word order.
- Pre-training vs fine-tuning
- Self-supervised learning on broad data, then task/instruction adaptation (incl. RLHF) on narrower data.
- RLHF
- Reinforcement Learning from Human Feedback — aligns an LLM's outputs to human preferences after pre-training.
- Vector database
- Stores embeddings and serves fast approximate-nearest-neighbour search — the retrieval backbone of RAG and two-tower reco.
- Hallucination
- An LLM stating fluent but false content; mitigated by grounding (RAG), guardrails and evaluation.
- LLM agent / tool use
- An LLM that calls functions/APIs to take actions and reason in multiple steps, not just generate text.
- Position bias
- Higher items get clicked more regardless of relevance; corrected by normalizing clicks by position propensity.
- Exploration vs exploitation
- The bandit trade-off between trying uncertain options and serving the current best; UCB and Thompson sampling balance it.
- UCB
- Upper Confidence Bound — picks the arm with the highest mean-plus-optimism-bonus, $\bar x_i+\sqrt{2\ln t/n_i}$.
- A/B test
- Online randomized experiment comparing variants on real users — the ground truth offline metrics only approximate.
- Multi-stage ranking
- Production pattern: cheap retrieval narrows millions of items, then progressively richer models score the survivors.
Annotated bibliography
Recommended texts from the official syllabus.
Useful primary papers referenced above: Koren et al. "Matrix Factorization Techniques for Recommender Systems"; Rendle "Factorization Machines"; He et al. "Neural Collaborative Filtering" and "LightGCN"; Kang & McAuley "SASRec"; Vaswani et al. "Attention Is All You Need"; Devlin et al. "BERT"; Lewis et al. "Retrieval-Augmented Generation".