Chatbots & Recommendation Engines
An interactive lab covering every core concept from the course — from utility functions to multi-armed bandits to production pipelines. Drag, click, sample, sweep. Watch the math change.
1 · Foundations: The Utility Function
A recommender learns $g:U\times I\to\mathbb{R}$ — a score for every (user, item) pair. For each user, pick the item that maximizes utility: $i^*_u = \arg\max_{i\in I} g(u,i)$.
🎛 Interactive user-item utility matrix
Click any cell to "rate" it. The system shows the row max (best item per user). Toggle the view to see a ranking vs a rating.
ML Taxonomy
| Paradigm | Input | Output |
|---|---|---|
| Classification | features X | categorical Y |
| Regression | features X | continuous Y |
| Recommendation | (user, item) | rating R |
Rating vs Ranking
Rating: predicts numeric utility, e.g. $g(u,i)=0.8$.
Ranking: orders items, e.g. $i_j \succ i_1 \succ i_2$.
2 · The Five Recommender Families
🃏 Pick a strategy and see what it does
Content-Based vs CF — Trade-offs
| Content-Based | Collaborative Filtering | |
|---|---|---|
| Data | Item metadata | Feedback only |
| Cold-start | Handles new items | ❌ Fails on new users/items |
| Risk | Over-specialization (filter bubble) | Data sparsity, scalability |
| Transparency | ✅ Explainable | Harder to explain |
3 · Explicit vs Implicit Feedback
Explicit signals (★ ratings, likes) are clear but sparse. Implicit signals (clicks, views, purchases) are abundant but ambiguous. A common bridge: map ordered implicit signals onto a rating scale.
🔄 Implicit → Explicit converter
Drag the bars to assign star ratings to each implicit signal. The example shows the default mapping.
4 · Non-Personalized Recommenders
Same recommendations for everyone. Two flavors: Random samples from $\mathcal{N}(\mu,\sigma)$ of training ratings, and Popular ranks by mean rating.
🎲 Random vs Popular vs Bayesian-Average
Bayesian average shrinks each item's mean toward the global mean proportional to how many ratings it has — fixing the "1 rating of 5.0 outranks 1M ratings averaging 4.9" problem.
Formula: $\bar{r}_i^{\text{Bayes}} = \dfrac{n_i\,\bar{r}_i + C\,\mu_{\text{global}}}{n_i + C}$
5 · Performance Metrics — Regression
When predicting numeric ratings: MAE, MSE, RMSE, R². Drag the true and predicted ratings below to watch the errors update.
📏 Live regression-metric calculator
6 · Performance Metrics — Classification
When recommendations are binary (clicked / not-clicked): Precision, Recall, F1, Accuracy.
🟦 Confusion-matrix sandbox
Click cells to flip predictions. Precision/Recall/F1 update instantly.
📈 ROC curve — drag the decision threshold
Each item has a true label (relevant / not) and a model score. As you slide the threshold, items flip predicted class. Watch the confusion matrix, TPR, FPR, and the point on the ROC curve move.
7 · Ranking Metrics — CG, DCG, NDCG, MRR, P@K
CG sums relevance for top-K but ignores position. DCG discounts lower positions via $\log_2$. NDCG normalizes against an ideal ranking. MRR uses $1/\text{rank}$ of the first hit.
🪜 Drag items to re-rank — DCG / NDCG / MRR / P@K update live
Each item has a relevance score (green pill). Reorder by dragging. The ideal ranking (sorted by relevance) gives NDCG = 1.0.
8 · Beyond-Accuracy Metrics
Coverage: % of catalog ever recommended. Personalization: how different recs are across users. Diversity (ILS): within-list category variety. Novelty: degree of unexpectedness.
🌈 Diversity vs Coverage vs Personalization sandbox
Toggle a "diversification" knob — see all four metrics shift.
9 · Cosine Similarity
The angle between two rating vectors. Range $[0,1]$ for non-negative vectors. Works for both CF (user/item vectors) and content-based (TF-IDF/BERT vectors).
$\cos(\theta) = \dfrac{A\cdot B}{\lVert A\rVert\,\lVert B\rVert}$
📐 Drag two 2D vectors — see the angle
10 · Memory-Based Collaborative Filtering
User-based: find users with similar history, recommend what they liked. Item-based: find items similar to those the user already liked.
👥 KNN-CF sandbox
Click a row (user) or column (item) — the system highlights the K nearest neighbours (by cosine over rated cells) and predicts the missing cells.
11 · Matrix Factorization & SVD
Decompose the sparse user-item matrix $R \approx U V^T$ into low-rank user and item factor matrices. SVD prediction with biases:
$\hat r_{ui} = \mu + b_u + b_i + q_i^T p_u$
🧮 Train a tiny SVD live (gradient descent)
Watch RMSE drop as the model fits. Adjust factors / learning rate / epochs and re-train.
12 · Content-Based: BoW & TF-IDF
BoW counts words. TF-IDF weights them by how rare they are in the corpus: $\text{TF-IDF}(t,d) = \text{TF}(t,d)\cdot \log\dfrac{N}{|\{d:t\in d\}|}$
📚 Type documents — see BoW & TF-IDF matrices
Edit any document. Vocabulary, BoW matrix, IDFs, and TF-IDF matrix recompute on every keystroke.
13 · Embeddings & PCA Visualization
Word2Vec is static (one vector per word). BERT is contextual (vectors depend on context). High-dim embeddings can be projected to 2D with PCA.
🧭 Mini embedding playground
Toggle words; watch how clustering changes when "synonyms" are merged.
Stemming vs Lemmatization
| Stemming | Lemmatization |
|---|---|
| studies → studi | studies → study |
| organizations → organ | better → good |
| Rule-based, fast, may produce non-words | Context-aware NLP, slower, valid words |
14 · Hybrid Recommenders
Combine recommenders to get the best of each. Three flavors: Weighted, Switching, Mixed.
🎚 Weighted hybrid: drag the slider
$\hat r_{ui} = w_1\cdot \text{CF}(u,i) + w_2 \cdot \text{CB}(u,i)$
🪢 Mixed hybrid: rank aggregation
Two recommenders rank items. We sum the rank scores to make a final list.
15 · Context-Aware Recommenders (CARS)
2D RS: $g:U\times I \to \mathbb{R}$. CARS adds context: $g:U\times I\times C\to\mathbb{R}$.
🧊 Pre-filter / Post-filter / Modeling — pick a paradigm
15b · Data Splits & the Cold-Start Problem
Pick a splitting scheme; see which users/items disappear from the training set, and which "cold" rows must fall back to a non-personalized model at serving time.
✂️ Train/Test split visualizer + cold-start detector
16 · Grid Search vs Bayesian Optimization
Grid search tries every combination on a discrete grid. Bayesian optimization uses a probabilistic surrogate to guess where to look next.
🔍 Search the same 2D loss surface with both methods
Hidden loss surface (dark = low loss). Watch how Bayesian opt converges with far fewer evaluations.
17 · Bias & the Feedback Loop
Data collection → model learning → serving → user interaction → back to data. Each stage can introduce or amplify bias.
♾ The feedback loop, visualized
📉 Position-bias correction
A higher position gets more clicks even if the item is no better. We correct via $R_i = P_{i,p}/P_p$.
Bias zoo
| Bias | Stage | Cause | Effect |
|---|---|---|---|
| Selection | user→data | users choose what to rate | MNAR data |
| Exposure | user→data | only shown subset | unseen ≠ negative |
| Conformity | user→data | peer-pressure rating | label drift |
| Position | user→data | top items click more | noisy positives |
| Popularity | model→user | algo + imbalance | rich-get-richer |
| Unfairness | model→user | imbalanced groups | discrimination |
18 · Multi-Armed Bandits — Explore vs Exploit
Each item is an "arm" with an unknown reward distribution. Strategies decide whether to exploit the current best or explore uncertain alternatives.
🎰 ε-Greedy live simulator
Five arms with hidden true CTRs. ε% of pulls are random. Watch cumulative regret.
🔔 Thompson Sampling — Beta-distribution playground
Each arm gets a $\text{Beta}(\alpha,\beta)$ — successes $\alpha$, failures $\beta$. We sample from each distribution and pick the arm with the highest sample. Uncertainty drives exploration.
19 · Production Pipeline — Retrieval → Filter → Score → Order
Production RS can't score billions of items at request time. The canonical 4-stage funnel narrows the candidate set while increasing compute per item.
🧴 The funnel — click each stage
Architecture latency types
| Architecture | Behavior | Latency |
|---|---|---|
| Batch | Precomputed, cache lookup | ~ms (but stale) |
| Real-time | Compute at request | Higher (model inference) |
| Multi-stage | Retrieval + ranking | Moderate |
| Hybrid | Precompute retrieval, real-time score | Moderate |
20 · Learning to Rank & Future Directions
Three formulations: pointwise (classify each item), pairwise (minimize inversions — BPR), listwise (optimize the whole list — RankALS, CLiMF).
⚖️ Pointwise vs Pairwise vs Listwise — animated
Value-aware recommendations — beyond engagement
Margin / revenue · content quality · long-term retention · fairness · content creation guidance. Single-objective CF is being replaced by multi-objective systems.
Netflix Prize takeaways
- Temporal dynamics matter — user tastes drift.
- User ratings are noisy and inconsistent.
- Heterogeneous ensembles win benchmarks.
- Offline gains (RMSE) ≠ production value — Netflix only shipped SVD++ + RBM.