AI: Chatbots & Recommendation Engines

The complete course structure — every module and all 30 sessions, with objectives, the core mathematics, key ideas and annotated readings. A study companion to the interactive lab of 25+ live demos.

30 sessions

6 ECTS credits

6 thematic modules

3 assessed deliverables

Course overview

Recommendation engines reshaped entire industries by cutting search costs and improving the user experience: with the sheer variety of products, films and music available today, customers would have little chance of finding the right item without search & recommendation engines. Chatbots, meanwhile, have become a primary channel for interacting with customers, and the rise of LLMs such as ChatGPT, Bing Chat and Bard has opened a new era of interactive search.

This course introduces recommender systems and chatbots, reviews real industry examples in detail, and teaches students to handle, apply and evaluate recommendation and chatbot methods. It covers the theoretical foundations alongside applied, methodological skills. By the end, students can build an end-to-end recommendation solution in Python at the level expected in a large company, and can develop and deploy a working chatbot.

Programme

BCSAI

Course code

AICRE-CSAI.3.M.A

Year · Semester

3rd · 2nd

Credits

6.0 ECTS

Sessions

Instructor

Miguel González-Fierro — Principal Data Science Manager at Microsoft Spain. Electrical Engineer (UC3M), PhD in Robotics (UC3M in collaboration with King's College London), and graduate of the MIT Sloan School of Management. Former CEO/founder of Samsamia Technologies (visual search engine for fashion) and founder of the Robotics Society of UC3M.
mgonzalezfierro@faculty.ie.edu · office hours on request

GenAI policy. Use of generative AI is encouraged to build an informed, critical perspective — but treat outputs as wrong until verified, refine prompts deliberately, and acknowledge any AI use (acknowledgement does not affect your grade; failing to acknowledge violates academic-honesty policy). A minimum-effort prompt yields a low-quality result; you remain responsible for any errors or omissions, and you can only validate AI output for topics you already understand — which is exactly why the theory in this course matters. If you used no AI, the recommended disclosure is: "No content generated by AI technologies has been used in this assignment."

Prerequisites & how this course connects

This is a third-year BCSAI course that assumes you arrive fluent in a few things and pays you back by tying them together:

Linear algebra — vectors, dot products, norms and matrix multiplication underpin cosine similarity, matrix factorization ($R\approx UV^{T}$) and attention ($QK^{T}$).
Probability & statistics — expectations, distributions and Bayes' rule power the Bayesian average, bandit exploration and the probabilistic view of ranking.
Calculus / optimization — gradients and gradient descent are how MF, factorization machines and neural recommenders are actually trained.
Machine-learning foundations — regression, classification, train/test discipline, regularization and overfitting carry over directly; Session 13 reframes recommendation as supervised learning.
Intermediate Python — NumPy/pandas and clean code; the engineering module (S8–S9) raises this to production standard.

It connects forward to NLP, deep learning and MLOps courses: the transformer and embedding machinery in Module 5 is the same machinery behind modern language models, and the serving/monitoring ideas in Session 14 generalize to any ML system.

Weekly study load. The course is 6 ECTS = 150 hours over ~15 teaching weeks, i.e. roughly 10 hours per week. Budget about 2 h of class contact plus ~8 h of independent work, and note that group work (60 h, 40%) is the single largest block — front-load it rather than leaving it to the checkpoints. A healthy rhythm: ~2 h reviewing readings/notes, ~2 h individual exercises, ~4 h project work each week, ramping up before the S6–7, S15–16 and S24–25 milestones.

Learning objectives

By the end of the course, students will be able to:

Understand the foundations. Explain the theoretical basis of recommender systems and chatbots and situate recommendation within the wider machine-learning taxonomy.
Handle and apply methods. Select, implement and tune the major recommendation paradigms — non-personalized, content-based, collaborative filtering, matrix factorization, hybrid and context-aware.
Evaluate rigorously. Measure systems with regression, classification, ranking and beyond-accuracy metrics, choose models, and operationalize them responsibly.
Build production-grade solutions. Engineer an end-to-end recommendation pipeline in Python at the level required in a large company, following sound development and MLOps practice.
Develop a chatbot. Implement and deploy a working chatbot, including modern LLM-based and retrieval-augmented approaches.
Think critically & ethically. Reason about bias, fairness and feedback loops, and the societal impact of deployed recommendation and conversational systems.

Teaching methodology & assessment

IE University's method is collaborative, active and applied: students participate throughout to build knowledge and sharpen skills, while the professor leads and guides. The total workload is 150 hours across the following activity mix.

Learning-activity weighting

Group work40.0%

Lectures20.0%

Discussions13.3%

Exercises / async / field work13.3%

Individual studying13.3%

Estimated effort: lectures 30h · discussions 20h · exercises 20h · group work 60h · individual study 20h.

Assessment weighting

Group project35%

Final exam30%

Individual contributions25%

Individual participation10%

Group project — 35%

The flagship deliverable: an end-to-end recommendation (and/or chatbot) solution in Python, defended across three checkpoints — project discussion (S6–S7), mid-term presentation (S15–S16) and final presentation (S24–S25). Mirrors how a real data-science team scopes, builds and ships.

Deliverable format: a clean GitHub repository (data/src/notebooks/tests + README, reproducible from a fresh clone) plus a slide deck defended live at each checkpoint.

Rubric (indicative): problem framing & baseline 20% · modelling soundness 25% · evaluation rigor (honest, task-matched metrics) 25% · engineering/reproducibility 15% · communication of the data→model→eval→product story 15%.

Tips: beat a simple baseline before chasing a fancy model; report where the model fails; keep the repo runnable.

Final exam — 30%

Comprehensive written exam (S30) over theory and methods: paradigms, similarity, matrix factorization, evaluation metrics, bias, and chatbot/LLM fundamentals.

Format: closed/comprehensive written exam; expect definitions, short derivations (e.g. compute cosine similarity, DCG/NDCG or an MF update by hand) and conceptual "which method and why" questions.

Tips: revise the formulas in each session's box and be able to reproduce one worked example per topic; the exam rewards connections between topics, not isolated definitions.

Individual contributions — 25%

Each student's measurable, individual share of the group work and exercises — assessed separately from the team grade so that real personal effort is rewarded.

Evidence: attributable Git commits/PRs, owned modules, and the contribution plan agreed in S7.

Tips: commit under your own identity with clear messages; own a coherent slice (e.g. evaluation, or the data pipeline) end-to-end rather than scattering tiny edits.

Individual participation — 10%

Active, prepared engagement in lectures and discussions across the semester.

Assessed on: quality (not just quantity) of contributions, evidence you did the reading, and constructive peer feedback during the project-discussion and presentation sessions.

Tips: come with one prepared question per reading; engaging with other teams' presentations counts.

Pass, attendance & re-sit rules. Passing grade is 5/10. Each subject allows up to four calls over two academic years (ordinary + extraordinary re-sits in June/July). Students below the 80% attendance rule fail both calls for the year and must re-enrol. The June/July re-sit is a comprehensive on-campus exam (Segovia or Madrid); the final grade then depends on that exam only (continuous evaluation is not counted) and is capped at 8.0/10. Retakers (3rd call) may reach 10.0 and must confirm criteria with the assigned professor. Failing more than 18 ECTS after the re-sits may require leaving the programme. A review session precedes any grade appeal.

Full program — 30 sessions

All thirty sessions are live and in-person. They are grouped below into six thematic modules. Each module opens with an overview and intended learning outcomes; every session is a timeline item with its objective, topic explanations, core formulas/definitions where technical, a key-idea callout, annotated readings, and links to the matching interactive demos.

Module 1 · Sessions 1–7

Foundations & the recommendation problem

What recommender systems and chatbots are, how the data behaves, the families of algorithms, how we evaluate them, and scoping the group project. This module builds the conceptual map for everything that follows.

Module learning outcomes

Frame recommendation as learning a utility function and distinguish it from classification/regression.
Characterize recommendation data: explicit vs implicit feedback, sparsity and the cold-start problem.
Compare the five recommender families and know when each applies.
Select appropriate evaluation metrics and an operationalization strategy.
Scope a realistic end-to-end group project.

SESSION 1 · LIVE IN-PERSON

Course logistics, organization & intro orientation

Objective: understand how the course runs, the assessment structure and the group-project expectations.

Syllabus & assessment map: the 35/30/25/10 split, the three project checkpoints, attendance and re-sit rules.
GenAI policy: encouraged but critically — verify, refine, and acknowledge use.
Team formation & tooling: forming project groups and previewing the Python/GitHub stack used later.

Key idea: this is an applied course — most of the grade comes from building and defending a real system, not from memorization.

Read · Aggarwal, Recommender Systems: The Textbook, ch. 1 (an introduction to recommender systems) to set context.

SESSION 2 · LIVE IN-PERSON

Introduction to recommendation & chatbots

Objective: define the recommendation problem formally and survey where recommenders and chatbots create business value.

The utility function: a recommender learns a score $g(u,i)$ for every (user, item) pair and, for each user, returns the highest-scoring items. Everything in the course — neighborhoods, latent factors, neural nets — is a different way to estimate this one function. Intuitively, $g$ encodes "how much would user $u$ like item $i$?" and the whole pipeline exists to fill in the unobserved entries of a giant, mostly-empty table.
Rating vs ranking: predicting a numeric utility (regression, e.g. "4.2 stars") is a different task from producing the right ordered list (ranking). A model can have great RMSE yet rank badly, because users only ever see the top of the list — which is why ranking metrics (Session 5) often matter more than rating error.
Business framing: recommenders reduce search cost, improve UX and surface the long tail (the many niche items no human could browse); chatbots add an interactive search/serving channel. Concretely, ~⅓ of Amazon purchases and the majority of Netflix viewing originate from recommendations.

$g:U\times I\to\mathbb{R}, \qquad i^{*}_{u}=\arg\max_{i\in I} g(u,i)$

Key idea: almost every recommender, however sophisticated, is ultimately a way to estimate and then maximize this utility function $g$.

Connects to: maximizing $g$ greedily ignores diversity and exploration — themes that return in beyond-accuracy metrics (S5) and bandits/ethics (S29). The "best single item" is rarely the best list.

Try in lab Utility matrix 5 recommender families

Read · Aggarwal ch. 1 — the goals of recommender systems and the rating-vs-ranking distinction. Skim · Li et al., Frontiers and Practices intro for the industry view (why reco is a revenue lever, not a feature).

SESSION 3 · LIVE IN-PERSON

Data in recommendation systems

Objective: understand the raw material of a recommender — feedback signals, their structure and their pathologies.

Explicit feedback: deliberate signals such as ★ ratings, thumbs and likes. The meaning is unambiguous (a 5 really means "loved it"), but users rate only a tiny fraction of what they consume, so explicit data is precious and scarce — and biased toward people who bother to rate.
Implicit feedback: behavioural traces — clicks, views, dwell time, add-to-cart, purchases, skips. Abundant and cheap, but ambiguous: a click can be a misclick, and a non-click can mean "unseen" rather than "disliked". A common modelling trick is to treat interactions as positive with a confidence that grows with the count (e.g. $c_{ui}=1+\alpha r_{ui}$ in implicit-ALS).
Sparsity & the long tail: the user-item matrix is typically >99% empty, and interactions follow a power law — a few blockbuster items get most of the feedback while the long tail gets almost none. Worked intuition: 10⁶ users × 10⁵ items = 10¹¹ cells, but if each user touches ~100 items only ~10⁸ are filled — density ≈ 0.1%.
Missing-not-at-random (MNAR): the entries you observe are not a random sample — users choose what to rate, and the system chose what to show. So absence of a rating is not a negative label, and naive "treat blanks as 0" training systematically distorts the model.

Key idea: implicit feedback is plentiful but noisy; the central modelling choice is how to interpret a missing entry.

Pitfall: evaluating on MNAR data inflates offline metrics — the test set over-represents popular items the system already pushed. This is why offline wins so often fail to reproduce in an online A/B test (S5, S10).

Try in lab Implicit→explicit converter Cold-start detector

Read · Aggarwal ch. 2 (rating types, neighborhood data); Concept · implicit-feedback & confidence weighting in Li et al. (Hu, Koren & Volinsky's implicit-ALS is the canonical reference).

SESSION 4 · LIVE IN-PERSON

Recommendation-systems algorithms overview

Objective: map the landscape of recommender families and their trade-offs before going deep on any one.

Non-personalized: the same list for everyone — random, or ranked by popularity / average rating. Trivial to build and a surprisingly strong baseline; the right fix for sparse averages is a Bayesian average that shrinks low-count items toward the global mean: $\bar r_i=\frac{C\mu+\sum_j r_{ij}}{C+n_i}$, so a 5.0 from two ratings doesn't outrank a 4.6 from a thousand.
Content-based: build a profile from the features of items a user liked, then score new items by feature match. Handles brand-new items (no interactions needed) and is explainable ("because you watched sci-fi"), but over-specializes into a filter bubble and can't surprise the user.
Collaborative filtering: learn purely from feedback patterns across users — "people like you liked this" — with no item metadata. Powerful and serendipitous, but cold-start-prone (needs history) and hurt by sparsity.
Hybrid: combine the above to cancel each other's weaknesses — weighted (blend scores), switching (use CB when CF lacks data), or mixed (present both). Most production systems are hybrids.
Context-aware (CARS): extend the pair to $g:U\times I\times C\to\mathbb{R}$ with context such as time, location or device, via pre-filtering, post-filtering or full contextual modelling (e.g. "lunch spots at noon, near me").

Content-based: $g_{CB}(u,i)=\cos\big(\text{profile}(u),\,\text{features}(i)\big)$

Key idea: there is no single best algorithm — the right choice depends on data availability, cold-start exposure, explainability needs and scale.

Connects to: content-based ↔ Session 13 (features as a supervised problem); CF ↔ Sessions 11–12 (neighborhood & latent-factor); hybrids are how cold-start (S3) is patched in practice.

Try in lab Strategy explorer Random vs popular vs Bayesian

Read · Aggarwal ch. 1 §1.3 (taxonomy) — keep this map handy all term; ch. 4 (content-based) & ch. 2 (neighborhood CF) as previews.

SESSION 5 · LIVE IN-PERSON

Evaluation, model selection & operationalization

Objective: learn how to judge a recommender offline, choose between models, and reason about serving it in production.

Regression metrics: MAE, MSE, RMSE and R² for rating prediction. RMSE squares errors, so it punishes a few large mistakes harder than MAE; use it when big misses are costly.
Classification metrics: precision, recall, F1, accuracy and the ROC/AUC curve for the relevant-vs-not view. In top-K reco, precision@K = (relevant in top K)/K answers "of what I showed, how much was good?", while recall@K answers "of all good items, how much did I surface?".
Ranking metrics: CG, DCG, NDCG, MRR and Precision@K — here order is everything because users scan from the top down. NDCG discounts gains by log-position and normalizes against the ideal ordering, giving a score in $[0,1]$.
Beyond accuracy: coverage (what fraction of the catalog ever gets shown), diversity (intra-list similarity), novelty and personalization — dimensions a pure accuracy number is blind to.
Data splits: random, per-user stratified, and time-based (train on the past, test on the future — the only honest split for a deployed system); offline metrics vs the online A/B-test reality.

$\mathrm{RMSE}=\sqrt{\tfrac{1}{n}\sum_{(u,i)}(\hat r_{ui}-r_{ui})^2} \qquad \mathrm{NDCG@K}=\frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}},\;\; \mathrm{DCG@K}=\sum_{j=1}^{K}\frac{rel_j}{\log_2(j+1)}$

Worked NDCG@3 — relevances of the returned ranking $[3,1,2]$: $\;\mathrm{DCG}=\frac{3}{\log_2 2}+\frac{1}{\log_2 3}+\frac{2}{\log_2 4}=3+0.63+1=4.63$. Ideal order $[3,2,1]$: $\;\mathrm{IDCG}=3+\frac{2}{1.585}+\frac{1}{2}=4.76$. So $\mathrm{NDCG@3}=4.63/4.76\approx \mathbf{0.97}$.

Key idea: a lower RMSE does not guarantee a better product — the metric must match the task (rating vs ranking) and the business goal.

Pitfall: a random train/test split leaks the future into the past (you train on interactions that happened after test ones). For anything time-ordered, split by time — and remember offline NDCG is only a proxy for the online metric that actually pays the bills.

Try in lab Regression metrics Confusion matrix & ROC DCG/NDCG/MRR Beyond-accuracy Train/test splits

Read · Aggarwal ch. 7 (evaluating recommender systems); Concept · NDCG & offline/online evaluation in Li et al.

SESSION 6 · LIVE IN-PERSON

Project discussion — presentation of projects project

Objective: present and refine group-project proposals — problem, dataset, baseline and success metric.

Problem & dataset: who are the users, what are the items, what feedback is available (explicit/implicit)?
Baseline & metric: choose a non-personalized baseline (popularity / Bayesian average) and the single metric you will improve — matched to the task (rating vs ranking) per Session 5.
Peer feedback: tighten scope and de-risk early.

Key idea: a sharp baseline + a single honest target metric is worth more than an ambitious but unmeasurable plan.

Deliverable: a one-page proposal — problem, dataset, baseline, target metric, and the split strategy you'll use — plus initial repo scaffolding.

SESSION 7 · LIVE IN-PERSON

Project discussion — presentation of projects project

Objective: complete proposal presentations and lock the project plan and team responsibilities.

Remaining presentations and consolidated feedback.
Work plan: milestones aligned to the mid-term (S15–16) and final (S24–25) checkpoints.
Individual contribution plan: who owns what (feeds the 25% contributions grade).

Key idea: defining individual ownership now is what makes the later individual-contribution assessment fair.

Module 2 · Sessions 8–10

Engineering practice & the product lens

The professional craft of building recommender/chatbot systems: project and repository setup, Python engineering practice, and a product-management perspective from an invited guest.

Module learning outcomes

Set up a reproducible Python project with version control and a clean structure.
Apply software-engineering practices (testing, modularity, environments) to ML code.
Connect technical choices to product strategy and stakeholder needs.

SESSION 8 · LIVE IN-PERSON

Development practices — set up project & GitHub lab

Objective: stand up a reproducible project skeleton under version control.

Repository structure: data / src / notebooks / tests separation; README and config.
Git & GitHub workflow: branches, commits, pull requests, code review.
Environments: virtual environments and pinned dependencies for reproducibility.

Key idea: reproducibility is a feature — if a teammate can't re-run your result from a clean clone, it doesn't exist.

SESSION 9 · LIVE IN-PERSON

Development practices — Python practices lab

Objective: write clean, testable Python for data and ML workloads.

Idiomatic Python & the data stack: NumPy / pandas vectorization, avoiding hidden loops.
Modularity & testing: functions over scripts, unit tests, type hints, linting.
Performance basics: sparse matrices for large user-item data.

Key idea: recommender data is sparse and large — representing it as a dense matrix is the most common way to run out of memory.

SESSION 10 · LIVE IN-PERSON

Invited guest — product management guest

Objective: see how recommendation/chatbot features are prioritized and shipped in industry.

From metric to roadmap: turning model gains into product decisions.
Experimentation culture: A/B tests, guardrail metrics, and knowing when offline wins don't ship.
Stakeholders: aligning data science, engineering and business.

Key idea: the model is one input to a product decision, not the decision itself.

Module 3 · Sessions 11–16

Similarity, matrix factorization & ML at scale

The core recommendation algorithms: similarity-based neighborhood methods, latent-factor models (matrix factorization & factorization machines), applying general ML, and the MLOps that puts them in production — bracketed by the mid-term project checkpoint.

Module learning outcomes

Implement user-based and item-based KNN collaborative filtering with cosine similarity.
Train and reason about matrix-factorization / SVD models with biases and regularization.
Cast recommendation as a general supervised-learning problem (e.g. factorization machines).
Apply MLOps practices: pipelines, monitoring, retraining and deployment.

SESSION 11 · LIVE IN-PERSON

Similarity-based methods

Objective: build memory-based collaborative filtering from similarity between users or items.

Cosine similarity: the cosine of the angle between two rating/feature vectors — it measures direction (taste pattern), ignoring magnitude, so a generous and a stingy rater with the same pattern still look similar. Ranges $[0,1]$ for non-negative vectors. Worked example: $A=[5,0,3],\,B=[4,0,2]$ ⇒ $A\!\cdot\!B=26$, $\lVert A\rVert=\sqrt{34}=5.83$, $\lVert B\rVert=\sqrt{20}=4.47$, so $\cos\theta=26/(5.83\cdot4.47)\approx\mathbf{0.998}$ — near-identical taste.
Pearson correlation: cosine on mean-centred ratings — subtracts each user's average first, which corrects for rating-scale bias and is often preferred for explicit ratings.
User-based CF: find users whose history is similar to yours, then recommend what they liked. Intuitive but the user-user matrix shifts constantly and is expensive to keep fresh.
Item-based CF: find items similar to those you already liked. Item-item relationships are far more stable over time and can be precomputed offline, which is why Amazon's classic engine is item-based.
KNN prediction: a similarity-weighted average of the neighbours' ratings — closer (more similar) neighbours count more.

$\cos(\theta)=\dfrac{A\cdot B}{\lVert A\rVert\,\lVert B\rVert}\qquad \hat r_{ui}=\dfrac{\sum_{v\in N_k(u)} \text{sim}(u,v)\,r_{vi}}{\sum_{v\in N_k(u)} |\text{sim}(u,v)|}$

Key idea: item-based CF usually beats user-based in production because item-item similarities are more stable and can be precomputed.

Pitfall: raw cosine treats unrated items as 0 (a strong negative signal) and is dominated by popular items co-rated by everyone. Mean-centring (Pearson) and shrinking similarities for low-overlap pairs are the usual fixes.

Try in lab Cosine similarity KNN-CF sandbox

Read · Aggarwal ch. 2 (neighborhood-based CF — derivations of user/item similarity and prediction); Concept · similarity functions in Li et al. Classic: Sarwar et al., "Item-Based Collaborative Filtering".

SESSION 12 · LIVE IN-PERSON

Matrix factorization & factorization machines

Objective: learn latent-factor models that decompose the sparse rating matrix into low-rank factors.

Latent factors: approximate the sparse rating matrix as $R\approx UV^{T}$, where each user and each item is a vector of $k$ hidden dimensions (e.g. "amount of comedy", "indie-ness") discovered by the model. A predicted rating is just the dot product $p_u^{T}q_i$ — aligned vectors score high.
SVD with biases: pure dot products miss that some users rate high and some items are universally liked; adding a global mean $\mu$ and user/item biases $b_u,b_i$ fixes most of this before the factors do any work.
The objective: minimize squared error on the observed entries only, with L2 regularization to stop the factors from overfitting the sparse data (see formula).
Learning: regularized gradient descent over observed entries — SGD (one rating at a time) or ALS (fix one factor matrix, solve the other in closed form; embarrassingly parallel).
Factorization machines: generalize MF to model all pairwise interactions among arbitrary features (user, item, context, side-info), unifying recommendation with regression and handling cold-start via features.

$\displaystyle\min_{p,q,b}\sum_{(u,i)\in\mathcal{K}}\!\big(r_{ui}-\mu-b_u-b_i-q_i^{T}p_u\big)^2+\lambda\big(\lVert p_u\rVert^2+\lVert q_i\rVert^2+b_u^2+b_i^2\big)$

SGD update (one observed rating): with error $e_{ui}=r_{ui}-\hat r_{ui}$, $\;p_u\leftarrow p_u+\gamma(e_{ui}\,q_i-\lambda p_u),\quad q_i\leftarrow q_i+\gamma(e_{ui}\,p_u-\lambda q_i)$. E.g. if $e_{ui}=+1$, each factor of $p_u$ nudges toward $q_i$ (and vice-versa), pulling the two vectors into closer alignment.

Key idea: matrix factorization discovers latent "taste" dimensions automatically — no hand-crafted features — which is why it dominated the Netflix Prize.

Pitfall: classic linear-algebra SVD requires a complete matrix, so you cannot just "run SVD" on the 99%-empty rating matrix — that's why MF is trained by SGD/ALS over observed cells, not by literal SVD. Too-large $k$ or too-small $\lambda$ overfits; tune both.

Try in lab Train a tiny SVD live

Read · Aggarwal ch. 3 (model-based CF / latent factor models — full derivation); Koren, Bell & Volinsky, "Matrix Factorization Techniques for Recommender Systems" (the classic, readable survey behind the Netflix Prize).

SESSION 13 · LIVE IN-PERSON

Applying general machine learning to recommendation

Objective: frame recommendation as a standard supervised-learning problem and bring the full ML toolbox.

Feature engineering: turn users, items and interactions into a feature vector — user demographics, item metadata, recency/frequency, and text via bag-of-words / TF-IDF or embeddings. The label is the interaction (click, purchase, rating).
TF-IDF intuition: weight each word by how often it appears in a document (TF) times how rare it is across the corpus (IDF), so common words like "the" get crushed and distinctive words dominate the profile. Worked example: in a 1,000-document corpus, "thriller" appears in 10 of them and 3× in this movie's synopsis ⇒ TF·IDF $=3\cdot\log(1000/10)=3\cdot2=\mathbf{6}$; "the" in all 1,000 ⇒ $\log(1000/1000)=0$, contributing nothing.
Models: logistic regression (fast, explainable), gradient-boosted trees (XGBoost/LightGBM — the industry workhorse for tabular click prediction), and neural nets.
Hyperparameter tuning: grid/random search vs Bayesian optimization, which models the score surface and samples where improvement is likely — far fewer trials when each training run is expensive.
Cold-start via content: because the model scores from features, it can rate a brand-new item or user with zero interaction history — the structural fix for the cold-start problem of Session 3.

$\text{TF-IDF}(t,d)=\text{TF}(t,d)\cdot\log\dfrac{N}{|\{d:t\in d\}|}$

Key idea: once you add side-features, "recommendation" becomes "predict the interaction" — a regression/classification problem you already know how to solve.

Connects to: this is the bridge between the content-based family (S4) and factorization machines (S12) — FMs are exactly "linear/logistic regression plus learned pairwise feature interactions". Gradient boosting here is also what powers many real-world ranking funnels (S14).

Try in lab BoW & TF-IDF Embeddings & PCA Grid vs Bayesian opt

Read · Aggarwal ch. 4–5 (content-based & knowledge-based); Rendle, "Factorization Machines" (the unifying view of MF and regression).

SESSION 14 · LIVE IN-PERSON

MLOps for recommendation & chatbots

Objective: take a trained model from notebook to a monitored, retrainable production service.

Pipelines: reproducible, version-controlled training and feature pipelines; data and model versioning so any result can be reproduced and rolled back.
Serving architectures: batch (precompute lists nightly — cheap, but stale) vs real-time (score on request — fresh, costly) vs multi-stage vs hybrid; the choice is a latency/freshness/cost trade-off.
Monitoring & retraining: watch for data and concept drift (the world changes, the model goes stale), collect feedback, and retrain on a schedule or a trigger.
The production funnel: retrieval (millions → thousands, cheap) → filtering (business rules) → scoring (rich model) → ordering/re-ranking (diversity, freshness).

Key idea: you can't score billions of items per request — production recommenders are funnels that narrow candidates while spending more compute per surviving item.

Connects to: the two-tower model (S17) is the standard retrieval stage and gradient-boosted trees (S13) a common scoring stage — the funnel is where every method in the course slots together.

Try in lab Retrieval→score funnel

Read · Li et al., MLOps / system-design chapters; Concept · multi-stage ranking architectures.

SESSION 15 · LIVE IN-PERSON

Mid-term project presentation project

Objective: present working progress — baseline beaten, first real model, honest metrics.

Pipeline & baseline results; first personalized model vs the baseline.
Evaluation: the chosen offline metric and what it does/doesn't capture.
Risks & next steps toward the final.

Key idea: show a result you can defend, including where it falls short — honesty about limitations is graded as competence.

SESSION 16 · LIVE IN-PERSON

Mid-term project presentation project

Objective: finish mid-term presentations and integrate feedback into the final plan.

Remaining presentations and cross-team feedback.
Plan adjustment for the back half of the course (DL, sequential, graph, chatbots).

Key idea: the mid-term exists to catch a wrong direction while there is still time to correct it.

Module 4 · Sessions 17–19

Modern recommendation models

State-of-the-art recommenders: deep learning, sequential models that respect the order of interactions, and graph-based methods that exploit the network structure of users and items.

Module learning outcomes

Explain neural recommendation architectures and where they beat classical models.
Model user behaviour as a sequence and recommend the next item.
Represent recommendation as a graph and apply graph neural networks.

SESSION 17 · LIVE IN-PERSON

Deep-learning models in recommendation systems

Objective: understand neural recommenders and how embeddings replace hand-crafted similarity.

Embeddings: learned dense vectors that place similar users/items/words near each other in space; static (Word2Vec/GloVe — one vector per word) vs contextual (BERT — the vector depends on the sentence). The same idea as MF's latent factors, now learned end-to-end.
Neural CF & two-tower models: replace the fixed dot product with a learned non-linear scorer (NCF). The two-tower design encodes user and item separately so item vectors can be precomputed and searched with fast approximate nearest-neighbour — the standard retrieval architecture at scale.
Learning to rank: optimize the order directly — pointwise (predict each score independently), pairwise (BPR: rank a positive above a sampled negative), or listwise (optimize a whole-list metric). BPR in words: for an observed item $i$ and an unobserved $j$, push $\hat r_{ui}$ above $\hat r_{uj}$ by maximizing $\ln\sigma(\hat r_{ui}-\hat r_{uj})$; when $\hat r_{ui}\!-\!\hat r_{uj}=2$, $\sigma(2)\approx0.88$, so the pair is already well-ordered and contributes little gradient.
Dimensionality reduction: PCA to compress and visualize high-dimensional embeddings in 2–3D.

BPR (pairwise): $\;\max \sum_{(u,i,j)} \ln\sigma\big(\hat r_{ui}-\hat r_{uj}\big)\;$ for $i$ preferred over $j$, where $\sigma(x)=1/(1+e^{-x})$

Key idea: deep models shine when you have rich side-information and lots of data; with small sparse data, regularized MF is often still the strongest baseline.

Pitfall: neural recommenders are easy to over-engineer — several famous "deep beats MF" results failed to reproduce against a properly-tuned MF baseline. Always benchmark against tuned MF before claiming a neural win.

Try in lab Embeddings & PCA Pointwise/pairwise/listwise

Read · Li et al., deep-learning-for-recommendation chapters; "Neural Collaborative Filtering" (He et al.).

SESSION 18 · LIVE IN-PERSON

Sequential recommendation systems

Objective: model the order of a user's interactions to predict the next one.

Session-based & next-item prediction: given the recent sequence $(i_1,\dots,i_{t})$, predict $i_{t+1}$ — crucial when you only have an anonymous session, not a long user history.
Architectures: RNN/GRU (GRU4Rec) process the sequence step by step; self-attention/transformer models (SASRec left-to-right, BERT4Rec with masking) let any past item attend to any other and train far faster — the same attention machinery as Session 22.
Temporal dynamics: tastes drift, items go in and out of fashion, and intent within a session is short-lived — a key Netflix-Prize lesson (time-aware models beat static ones).

Key idea: classical CF treats interactions as a bag; sequential models treat them as a story — order carries intent.

Key takeaway: use sequential models when recency and order dominate the signal (news, music, e-commerce sessions); they reuse the transformer you learn for LLMs.

Read · Li et al., sequential-recommendation chapter; SASRec (Kang & McAuley) for the self-attention approach; GRU4Rec (Hidasi et al.) for the RNN baseline.

SESSION 19 · LIVE IN-PERSON

Graph recommendation systems

Objective: exploit the user-item interaction graph with graph neural networks.

Bipartite graph: users and items are two node sets, interactions are edges — a graph view of the very same user-item matrix from Session 3.
Graph neural networks: each node repeatedly aggregates ("message-passes") its neighbours' embeddings; after $L$ layers a node has absorbed information from $L$ hops away. LightGCN strips the GNN down to just neighbour averaging and shows the heavy non-linearities often don't help for reco.
High-order connectivity: stacking layers captures multi-hop "users who liked X also liked Y, and those users also liked Z" paths that a flat dot product can't see.

Key idea: GNNs make the collaborative signal explicit — they propagate preferences along the graph instead of inferring them from a flat matrix.

Pitfall: too many GNN layers cause over-smoothing — every node's embedding converges to the same vector and personalization collapses; 2–4 layers is usually the sweet spot.

Read · Li et al., graph-based recommendation chapter; LightGCN (He et al.) and NGCF for the message-passing formulation.

Module 5 · Sessions 20–27

Chatbots & large language models

The conversational half of the course: from classical chatbot foundations to modern Q&A methodologies, the fundamentals of LLMs, operationalizing chatbots, and advanced application-building — with the final project defended in the middle of this stretch.

Module learning outcomes

Describe chatbot architectures from intent/NLU pipelines to generative LLMs.
Explain autoencoder and autoregressive approaches to Q&A.
Understand LLM fundamentals: tokenization, attention, pre-training and fine-tuning.
Build and operationalize an LLM-powered application (e.g. RAG).

SESSION 20 · LIVE IN-PERSON

Introduction to chatbots

Objective: understand the classical chatbot stack and how conversational systems are structured.

Chatbot types: rule-based (scripted patterns — predictable, brittle), retrieval-based (pick the best response from a fixed set — safe, can't generalize), and generative (compose a new response — flexible, can hallucinate).
Intent & NLU: classify the user's intent (e.g. book_flight) and extract entities/slots (date, destination) from an utterance — a text-classification + sequence-labelling problem, mirroring Session 13's "text as features".
Dialog management: track conversation state and decide the next action/response (the policy), then realize it as text (NLG).

Intent classification: $\;\text{intent}^{*}=\arg\max_{c}\;P(c\mid \text{utterance})$ — the same $\arg\max$ over a conditional score as the recommender's $g(u,i)$.

Key idea: a chatbot is a pipeline — understand (NLU) → decide (dialog policy) → respond (NLG); LLMs increasingly collapse these stages into one model.

Pitfall: a pure generative model with no grounding will confidently invent facts; production assistants almost always bolt retrieval/guardrails (S23, S26) onto the generator.

Read · Alto, Building LLM Powered Applications, early chapters on conversational AI and the classical NLU pipeline.

SESSION 21 · LIVE IN-PERSON

Q&A modern methodologies: autoencoders & autoregressive algorithms

Objective: contrast the two dominant transformer paradigms behind modern Q&A.

Autoencoding (BERT-style): bidirectional, masked-language-model pre-training — great for understanding/extraction.
Autoregressive (GPT-style): left-to-right next-token prediction — great for generation.
Extractive vs generative Q&A: pointing to a span vs composing an answer.

Autoregressive LM: $\;P(x_{1:T})=\prod_{t=1}^{T}P(x_t\mid x_{

Key idea: encoders read, decoders write — the autoencoding/autoregressive split explains why BERT excels at understanding and GPT at generation.

Read · "Attention Is All You Need" (Vaswani et al.); BERT (Devlin et al.) for the autoencoding side.

SESSION 22 · LIVE IN-PERSON

Fundamentals of LLMs

Objective: understand how large language models are built and adapted.

Tokenization & embeddings: text is split into subword tokens (e.g. BPE) and each is mapped to a vector; positional encodings add word order, which attention itself doesn't capture.
Self-attention & transformers: every token forms a query $Q$, key $K$ and value $V$; the dot product $QK^{T}$ scores how much each token should attend to every other, softmax turns those scores into weights, and the output is a weighted sum of values. Why the $\sqrt{d_k}$? it rescales the dot products so that, for large $d_k$, they don't blow up and push softmax into vanishing-gradient saturation. Multiple "heads" let the model attend to several relationships at once.
Pre-training → fine-tuning → alignment: self-supervised pre-training on huge corpora, then instruction tuning and RLHF/DPO to make the model helpful and aligned.
Prompting: zero/few-shot and chain-of-thought; prompt quality drives output quality, which is exactly the GenAI-policy point from Session 1.

$\text{Attention}(Q,K,V)=\text{softmax}\!\Big(\dfrac{QK^{T}}{\sqrt{d_k}}\Big)V$

Key idea: the transformer's scaled dot-product attention is the single mechanism underlying essentially every modern LLM.

Connects to: attention is a similarity-weighted average ($QK^{T}$ is a dot-product similarity, just like cosine in Session 11) — the same primitive recurs from CF to sequential recommenders (SASRec/BERT4Rec, S18) to LLMs.

Read · Alto, LLM-fundamentals chapters; Vaswani et al., "Attention Is All You Need" (the original transformer).

SESSION 23 · LIVE IN-PERSON

Operationalization of chatbots

Objective: deploy, monitor and safeguard a chatbot in production.

Serving: latency, cost and context-window management for LLM calls.
Guardrails & evaluation: hallucination checks, safety filters, and how to measure chatbot quality.
Feedback & iteration: logging conversations and improving over time.

Key idea: the hard part of a production chatbot is not generation — it's grounding, evaluation and guardrails.

Read · Alto, deployment & productionization chapters.

SESSION 24 · LIVE IN-PERSON

Final project presentation project

Objective: present the finished end-to-end system and defend its design and results.

Full pipeline & final metrics vs baseline and mid-term.
Design justification: why these algorithms, features and trade-offs.
Limitations & ethics of the deployed system.

Key idea: the final defence rewards a coherent story from data → model → evaluation → product, not just the best number.

SESSION 25 · LIVE IN-PERSON

Final project presentation project

Objective: complete final presentations and peer evaluation.

Remaining presentations and Q&A.
Cross-team comparison: what worked across different problems and datasets.

Key idea: seeing how other teams solved different problems is part of the learning — patterns generalize.

SESSION 26 · LIVE IN-PERSON

Advanced methods for building chatbot applications

Objective: build grounded, tool-using LLM applications beyond raw chat.

Retrieval-augmented generation (RAG): embed the query, retrieve the most similar document chunks from a vector store, and condition the LLM's answer on them — so the model quotes your data instead of its frozen, possibly-wrong memory. The retrieval step is, mathematically, the same nearest-neighbour search as two-tower reco (S17).
Vector databases & embeddings search: store chunk embeddings and serve fast approximate-nearest-neighbour lookups — the retrieval backbone of RAG; chunking strategy and embedding quality dominate end-to-end accuracy.
Prompt engineering & orchestration: templates, few-shot examples and frameworks (e.g. LangChain) that wire retrieval, prompting and post-processing together.

RAG: $\;P(\text{answer}\mid q)=\sum_{d\in \text{retrieve}(q)} P(\text{answer}\mid q,d)\,P(d\mid q)$

Key idea: RAG grounds an LLM in your data — the single most effective way to reduce hallucination and keep answers current.

Pitfall: RAG fails silently when retrieval fails — if the right chunk isn't retrieved, the model fluently answers from priors. Evaluate retrieval (recall@k of the gold chunk) separately from generation.

Read · Alto, RAG & agent chapters; Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks".

SESSION 27 · LIVE IN-PERSON

Advanced methods for building chatbot applications (continued)

Objective: extend chatbots with agents, tools and evaluation.

Agents & tool use: letting an LLM call functions/APIs to act, not just answer.
Memory & multi-step reasoning; chaining and planning.
Evaluating LLM apps: task success, faithfulness and cost.

Key idea: agents turn an LLM from a text generator into a system that can take actions — which raises the bar for evaluation and guardrails.

Read · Alto, agents/tools chapters.

Module 6 · Sessions 28–30

Industry perspective, ethics & final exam

Closing the loop: a second industry guest, a group discussion on ethical AI, and the comprehensive final exam.

Module learning outcomes

Relate course concepts to real data-science practice.
Critically evaluate bias, fairness and feedback loops in deployed systems.
Demonstrate comprehensive mastery in the final exam.

SESSION 28 · LIVE IN-PERSON

Invited guest — Data Science guest

Objective: connect the course to real-world data-science practice through an industry guest.

Real systems at scale: war stories on data, modelling and deployment.
Career & craft: what working on recommenders/chatbots in industry actually looks like.

Key idea: production data science is mostly about data quality, evaluation and iteration — the model is a small slice.

SESSION 29 · LIVE IN-PERSON

Group discussion: ethical AI & course wrap-up

Objective: critically examine bias and fairness in recommenders/chatbots and consolidate the course.

The feedback loop: data → model → serving → user → data; each stage can introduce or amplify bias.
Bias zoo: selection, exposure, conformity, position, popularity and unfairness biases.
Position-bias correction and value-aware, multi-objective recommendation (beyond pure engagement).
Exploration as a remedy: breaking the loop requires deliberately showing uncertain items. A multi-armed bandit balances exploit (the current best) vs explore (uncertain) — e.g. UCB adds an optimism bonus that shrinks as an arm is tried more. Worked intuition: an arm seen only 4 times gets bonus $\sqrt{2\ln 100/4}\approx1.52$, vs one seen 100 times $\sqrt{2\ln100/100}\approx0.30$ — so the rarely-shown item is given a fair chance.
Wrap-up: synthesizing recommenders + chatbots into one mental model — both estimate a conditional distribution and serve from it under constraints.

Position-bias correction: $\;R_i = \dfrac{P_{i,p}}{P_{p}}$ (observed click rate normalized by position propensity)

UCB1 arm selection: $\;a_t=\arg\max_i\Big(\bar x_i+\sqrt{\tfrac{2\ln t}{n_i}}\Big)$ — mean reward plus an exploration bonus that grows with total pulls $t$ and shrinks with this arm's pulls $n_i$.

Key idea: a recommender trains on data it generated, so unchecked bias compounds — "rich-get-richer" popularity loops are the default, not the exception.

Connects to: exploration (bandits) is the principled answer to the MNAR/feedback-loop problems first raised in Session 3 — you fix biased data by changing what you collect, not only how you model it.

Try in lab Feedback loop & bias zoo Multi-armed bandits Value-aware ranking

Read · Aggarwal ch. 13 (advanced topics); a fairness-in-recsys survey of your choice.

SESSION 30 · LIVE IN-PERSON

Final exam exam · 30%

Objective: demonstrate comprehensive mastery of recommendation and chatbot theory and methods.

Coverage: paradigms, similarity, matrix factorization, evaluation metrics, bias, and chatbot/LLM fundamentals.
Format: comprehensive written exam worth 30% of the course grade (passing grade 5/10).

Key idea: revise across the whole arc — the exam tests connections between topics, not isolated definitions.

Key concepts — glossary

A quick reference to the core vocabulary of the course. Many terms have a live demo in the interactive lab.

Utility function: The score $g(u,i)$ a recommender assigns to each user-item pair; recommendation maximizes it per user.

Explicit feedback: Deliberate signals like star ratings or likes — clear in meaning but sparse.

Implicit feedback: Behavioural signals like clicks, views and purchases — abundant but ambiguous (absence ≠ dislike).

Cold-start problem: The difficulty of recommending for new users or new items that have no interaction history.

Data sparsity: The user-item matrix is mostly empty; the vast majority of pairs have no observed feedback.

Non-personalized recommender: Recommends the same items to everyone, e.g. by random sampling or popularity.

Bayesian average: Shrinks an item's mean rating toward the global mean in proportion to how few ratings it has, fixing low-count outliers.

Content-based filtering: Recommends items whose metadata/features match a profile built from the user's past likes; handles new items but risks filter bubbles.

Collaborative filtering (CF): Recommends from patterns in feedback across many users, without item metadata; strong but cold-start-prone.

User-based vs item-based CF: Neighborhood CF using similar users versus similar items; item-based is more stable and cacheable at scale.

Cosine similarity: $\cos\theta=\frac{A\cdot B}{\lVert A\rVert\lVert B\rVert}$ — the angle between two vectors; the standard similarity in CF and content-based methods.

K-nearest neighbours (KNN): Predicts via a similarity-weighted average over the K most similar users or items.

Matrix factorization: Approximates $R\approx UV^{T}$, learning low-rank latent user and item factors that capture hidden tastes.

SVD with biases: $\hat r_{ui}=\mu+b_u+b_i+q_i^{T}p_u$ — factorization plus global/user/item bias terms.

Factorization machine: Generalizes MF to model all pairwise feature interactions, unifying recommendation with regression.

TF-IDF: Term frequency × inverse document frequency — weights words by how informative they are in a corpus.

Embedding: A learned dense vector representing a user, item or word; static (Word2Vec) or contextual (BERT).

Hybrid recommender: Combines several recommenders — weighted, switching or mixed — to offset each one's weaknesses.

Context-aware RS (CARS): Extends $g:U\times I\to\mathbb{R}$ to $g:U\times I\times C\to\mathbb{R}$ via pre-filtering, post-filtering or contextual modeling.

RMSE / MAE: Root-mean-square / mean-absolute error — regression metrics for rating prediction; RMSE penalizes large errors more.

Precision / Recall / F1: Classification metrics for relevant-vs-not recommendations; F1 is their harmonic mean.

NDCG: Normalized Discounted Cumulative Gain — a ranking metric that discounts relevance by position and normalizes against the ideal order.

MAP / MRR: Mean Average Precision and Mean Reciprocal Rank ($1/\text{rank}$ of the first hit) — order-sensitive ranking metrics.

Beyond-accuracy metrics: Coverage, diversity (ILS), novelty and personalization — quality dimensions accuracy alone misses.

Learning to rank: Optimizing the ranking directly: pointwise, pairwise (BPR) or listwise objectives.

Multi-armed bandit: Online learning that balances exploiting the current best arm against exploring uncertain ones (ε-greedy, Thompson sampling).

Feedback loop & bias: Recommenders train on data they generated; selection, exposure, position and popularity biases can compound over time.

Production funnel: Retrieval → filtering → scoring → ordering — narrows billions of candidates while increasing compute per item.

Intent / NLU: Natural-language understanding: classifying a user's intent and extracting entities from an utterance.

Autoencoding vs autoregressive: BERT-style bidirectional masked modelling (understanding) vs GPT-style next-token prediction (generation).

Self-attention: $\text{softmax}(QK^{T}/\sqrt{d_k})V$ — the transformer mechanism letting each token attend to all others.

RAG: Retrieval-Augmented Generation — retrieve relevant documents and condition the LLM's answer on them to reduce hallucination.

MLOps: Engineering practice for reliably training, deploying, monitoring and retraining ML systems in production.

Utility matrix: The (mostly empty) user × item table of observed feedback; recommendation = predicting its missing entries.

Long tail: The many niche items that each get little feedback but collectively form most of the catalog; recommenders' job is to surface them.

MNAR (missing-not-at-random): Observed entries aren't a random sample — users and the system both choose what gets rated, biasing offline evaluation.

Pearson correlation: Cosine similarity on mean-centred ratings; removes per-user rating-scale bias, often preferred for explicit feedback.

Latent factor: A learned hidden dimension (e.g. "amount of comedy") in MF/embeddings that captures taste without being hand-labelled.

Regularization (λ): A penalty on parameter size ($\lambda\lVert\cdot\rVert^2$) that combats overfitting on sparse data; central to MF training.

SGD vs ALS: Two ways to fit MF: stochastic gradient descent (one rating at a time) vs alternating least squares (fix one factor, solve the other in closed form, parallelizable).

BPR: Bayesian Personalized Ranking — a pairwise loss that ranks an observed item above a sampled unobserved one, $\ln\sigma(\hat r_{ui}-\hat r_{uj})$.

Precision@K / Recall@K: Top-K ranking metrics: fraction of the K shown that are relevant, vs fraction of all relevant items that were surfaced.

DCG / IDCG: Discounted Cumulative Gain and its ideal (best-possible) value; their ratio is NDCG.

Coverage: The fraction of the catalog a recommender ever recommends; low coverage signals popularity bias.

Diversity / novelty: How dissimilar (intra-list) and how unexpected the recommended items are — beyond-accuracy quality dimensions.

Two-tower model: Separate user and item encoders whose outputs are compared by dot product; item vectors are precomputed for fast ANN retrieval.

Neural CF (NCF): Collaborative filtering with a learned non-linear scoring function replacing the fixed dot product.

Graph neural network (GNN): Learns node embeddings by message-passing over the user-item graph; LightGCN/NGCF capture multi-hop collaborative signal.

Sequential recommendation: Predicts the next item from the ordered history of interactions (GRU4Rec, SASRec, BERT4Rec).

Over-smoothing: Failure mode where too many GNN layers make all node embeddings converge, destroying personalization.

Tokenization: Splitting text into subword units (e.g. BPE) that an LLM maps to embeddings; the input granularity of transformers.

Positional encoding: Information added to token embeddings so the order-agnostic attention mechanism can use word order.

Pre-training vs fine-tuning: Self-supervised learning on broad data, then task/instruction adaptation (incl. RLHF) on narrower data.

RLHF: Reinforcement Learning from Human Feedback — aligns an LLM's outputs to human preferences after pre-training.

Vector database: Stores embeddings and serves fast approximate-nearest-neighbour search — the retrieval backbone of RAG and two-tower reco.

Hallucination: An LLM stating fluent but false content; mitigated by grounding (RAG), guardrails and evaluation.

LLM agent / tool use: An LLM that calls functions/APIs to take actions and reason in multiple steps, not just generate text.

Position bias: Higher items get clicked more regardless of relevance; corrected by normalizing clicks by position propensity.

Exploration vs exploitation: The bandit trade-off between trying uncertain options and serving the current best; UCB and Thompson sampling balance it.

UCB: Upper Confidence Bound — picks the arm with the highest mean-plus-optimism-bonus, $\bar x_i+\sqrt{2\ln t/n_i}$.

A/B test: Online randomized experiment comparing variants on real users — the ground truth offline metrics only approximate.

Multi-stage ranking: Production pattern: cheap retrieval narrows millions of items, then progressively richer models score the survivors.

Annotated bibliography

Recommended texts from the official syllabus.

Charu C. Aggarwal — Recommender Systems: The Textbook

Springer, 2016 · ISBN 3319296574

The course's foundational reference. Rigorous, comprehensive coverage of neighborhood and model-based CF, content-based and knowledge-based methods, evaluation, context-awareness and advanced topics. Best paired with Modules 1–3 (sessions on data, algorithms, similarity, MF and evaluation).

D. Li, J. Lian, L. Zhang, K. Ren, D. Lu, T. Wu, X. Xie — Recommender Systems: Frontiers and Practices

Springer · ISBN 9819989639

A modern, industry-oriented complement to Aggarwal. Covers deep-learning, sequential and graph recommenders plus system design and MLOps — directly supporting Modules 3–4 (sessions on MLOps, deep learning, sequential and graph recommendation).

Valentina Alto — Building LLM Powered Applications

Packt · ISBN 1835462316

The chatbot/LLM reference for the course. Practical guidance on building intelligent apps and agents with large language models, including RAG and deployment — the core text for Module 5 (sessions on chatbots, LLM fundamentals, operationalization and advanced application building).

Useful primary papers referenced above: Koren et al. "Matrix Factorization Techniques for Recommender Systems"; Rendle "Factorization Machines"; He et al. "Neural Collaborative Filtering" and "LightGCN"; Kang & McAuley "SASRec"; Vaswani et al. "Attention Is All You Need"; Devlin et al. "BERT"; Lewis et al. "Retrieval-Augmented Generation".