Chatbots & Recommendation Engines

An interactive lab covering every core concept from the course — from utility functions to multi-armed bandits to production pipelines. Drag, click, sample, sweep. Watch the math change.

12 chapters

25+ interactive demos

0 dependencies to install

1 · Foundations: The Utility Function

A recommender learns $g:U\times I\to\mathbb{R}$ — a score for every (user, item) pair. For each user, pick the item that maximizes utility: $i^*_u = \arg\max_{i\in I} g(u,i)$.

🎛 Interactive user-item utility matrix

Click any cell to "rate" it. The system shows the row max (best item per user). Toggle the view to see a ranking vs a rating.

Users: 6 Items: 8 show ranking (per row)

ML Taxonomy

Paradigm	Input	Output
Classification	features X	categorical Y
Regression	features X	continuous Y
Recommendation	(user, item)	rating R

Rating vs Ranking

Rating: predicts numeric utility, e.g. $g(u,i)=0.8$.
Ranking: orders items, e.g. $i_j \succ i_1 \succ i_2$.

2 · The Five Recommender Families

🃏 Pick a strategy and see what it does

Content-Based vs CF — Trade-offs

	Content-Based	Collaborative Filtering
Data	Item metadata	Feedback only
Cold-start	Handles new items	❌ Fails on new users/items
Risk	Over-specialization (filter bubble)	Data sparsity, scalability
Transparency	✅ Explainable	Harder to explain

3 · Explicit vs Implicit Feedback

Explicit signals (★ ratings, likes) are clear but sparse. Implicit signals (clicks, views, purchases) are abundant but ambiguous. A common bridge: map ordered implicit signals onto a rating scale.

🔄 Implicit → Explicit converter

Drag the bars to assign star ratings to each implicit signal. The example shows the default mapping.

4 · Non-Personalized Recommenders

Same recommendations for everyone. Two flavors: Random samples from $\mathcal{N}(\mu,\sigma)$ of training ratings, and Popular ranks by mean rating.

🎲 Random vs Popular vs Bayesian-Average

Bayesian average shrinks each item's mean toward the global mean proportional to how many ratings it has — fixing the "1 rating of 5.0 outranks 1M ratings averaging 4.9" problem.

Bayesian prior weight C: 20

Formula: $\bar{r}_i^{\text{Bayes}} = \dfrac{n_i\,\bar{r}_i + C\,\mu_{\text{global}}}{n_i + C}$

5 · Performance Metrics — Regression

When predicting numeric ratings: MAE, MSE, RMSE, R². Drag the true and predicted ratings below to watch the errors update.

📏 Live regression-metric calculator

6 · Performance Metrics — Classification

When recommendations are binary (clicked / not-clicked): Precision, Recall, F1, Accuracy.

🟦 Confusion-matrix sandbox

Click cells to flip predictions. Precision/Recall/F1 update instantly.

📈 ROC curve — drag the decision threshold

Each item has a true label (relevant / not) and a model score. As you slide the threshold, items flip predicted class. Watch the confusion matrix, TPR, FPR, and the point on the ROC curve move.

Threshold τ: 0.50

7 · Ranking Metrics — CG, DCG, NDCG, MRR, P@K

CG sums relevance for top-K but ignores position. DCG discounts lower positions via $\log_2$. NDCG normalizes against an ideal ranking. MRR uses $1/\text{rank}$ of the first hit.

🪜 Drag items to re-rank — DCG / NDCG / MRR / P@K update live

Each item has a relevance score (green pill). Reorder by dragging. The ideal ranking (sorted by relevance) gives NDCG = 1.0.

K = 5

8 · Beyond-Accuracy Metrics

Coverage: % of catalog ever recommended. Personalization: how different recs are across users. Diversity (ILS): within-list category variety. Novelty: degree of unexpectedness.

🌈 Diversity vs Coverage vs Personalization sandbox

Toggle a "diversification" knob — see all four metrics shift.

Diversification λ: 0.2 Users: 20

9 · Cosine Similarity

The angle between two rating vectors. Range $[0,1]$ for non-negative vectors. Works for both CF (user/item vectors) and content-based (TF-IDF/BERT vectors).

$\cos(\theta) = \dfrac{A\cdot B}{\lVert A\rVert\,\lVert B\rVert}$

📐 Drag two 2D vectors — see the angle

10 · Memory-Based Collaborative Filtering

User-based: find users with similar history, recommend what they liked. Item-based: find items similar to those the user already liked.

👥 KNN-CF sandbox

Click a row (user) or column (item) — the system highlights the K nearest neighbours (by cosine over rated cells) and predicts the missing cells.

K = 2 Mode:

11 · Matrix Factorization & SVD

Decompose the sparse user-item matrix $R \approx U V^T$ into low-rank user and item factor matrices. SVD prediction with biases:

$\hat r_{ui} = \mu + b_u + b_i + q_i^T p_u$

🧮 Train a tiny SVD live (gradient descent)

Watch RMSE drop as the model fits. Adjust factors / learning rate / epochs and re-train.

n_factors: 2 lr: 0.010 epochs: 100 reg: 0.05

12 · Content-Based: BoW & TF-IDF

BoW counts words. TF-IDF weights them by how rare they are in the corpus: $\text{TF-IDF}(t,d) = \text{TF}(t,d)\cdot \log\dfrac{N}{|\{d:t\in d\}|}$

📚 Type documents — see BoW & TF-IDF matrices

Edit any document. Vocabulary, BoW matrix, IDFs, and TF-IDF matrix recompute on every keystroke.

13 · Embeddings & PCA Visualization

Word2Vec is static (one vector per word). BERT is contextual (vectors depend on context). High-dim embeddings can be projected to 2D with PCA.

🧭 Mini embedding playground

Toggle words; watch how clustering changes when "synonyms" are merged.

Stemming vs Lemmatization

Stemming	Lemmatization
studies → studi	studies → study
organizations → organ	better → good
Rule-based, fast, may produce non-words	Context-aware NLP, slower, valid words

14 · Hybrid Recommenders

Combine recommenders to get the best of each. Three flavors: Weighted, Switching, Mixed.

🎚 Weighted hybrid: drag the slider

$\hat r_{ui} = w_1\cdot \text{CF}(u,i) + w_2 \cdot \text{CB}(u,i)$

w₁ (CF weight): 0.60

🪢 Mixed hybrid: rank aggregation

Two recommenders rank items. We sum the rank scores to make a final list.

15 · Context-Aware Recommenders (CARS)

2D RS: $g:U\times I \to \mathbb{R}$. CARS adds context: $g:U\times I\times C\to\mathbb{R}$.

🧊 Pre-filter / Post-filter / Modeling — pick a paradigm

15b · Data Splits & the Cold-Start Problem

Pick a splitting scheme; see which users/items disappear from the training set, and which "cold" rows must fall back to a non-personalized model at serving time.

✂️ Train/Test split visualizer + cold-start detector

Method:

16 · Grid Search vs Bayesian Optimization

Grid search tries every combination on a discrete grid. Bayesian optimization uses a probabilistic surrogate to guess where to look next.

🔍 Search the same 2D loss surface with both methods

Hidden loss surface (dark = low loss). Watch how Bayesian opt converges with far fewer evaluations.

17 · Bias & the Feedback Loop

Data collection → model learning → serving → user interaction → back to data. Each stage can introduce or amplify bias.

♾ The feedback loop, visualized

📉 Position-bias correction

A higher position gets more clicks even if the item is no better. We correct via $R_i = P_{i,p}/P_p$.

Bias zoo

Bias	Stage	Cause	Effect
Selection	user→data	users choose what to rate	MNAR data
Exposure	user→data	only shown subset	unseen ≠ negative
Conformity	user→data	peer-pressure rating	label drift
Position	user→data	top items click more	noisy positives
Popularity	model→user	algo + imbalance	rich-get-richer
Unfairness	model→user	imbalanced groups	discrimination

18 · Multi-Armed Bandits — Explore vs Exploit

Each item is an "arm" with an unknown reward distribution. Strategies decide whether to exploit the current best or explore uncertain alternatives.

🎰 ε-Greedy live simulator

Five arms with hidden true CTRs. ε% of pulls are random. Watch cumulative regret.

ε: 0.10

🔔 Thompson Sampling — Beta-distribution playground

Each arm gets a $\text{Beta}(\alpha,\beta)$ — successes $\alpha$, failures $\beta$. We sample from each distribution and pick the arm with the highest sample. Uncertainty drives exploration.

19 · Production Pipeline — Retrieval → Filter → Score → Order

Production RS can't score billions of items at request time. The canonical 4-stage funnel narrows the candidate set while increasing compute per item.

🧴 The funnel — click each stage

Architecture latency types

Architecture	Behavior	Latency
Batch	Precomputed, cache lookup	~ms (but stale)
Real-time	Compute at request	Higher (model inference)
Multi-stage	Retrieval + ranking	Moderate
Hybrid	Precompute retrieval, real-time score	Moderate

20 · Learning to Rank & Future Directions

Three formulations: pointwise (classify each item), pairwise (minimize inversions — BPR), listwise (optimize the whole list — RankALS, CLiMF).

⚖️ Pointwise vs Pairwise vs Listwise — animated

Value-aware recommendations — beyond engagement

Margin / revenue · content quality · long-term retention · fairness · content creation guidance. Single-objective CF is being replaced by multi-objective systems.

Netflix Prize takeaways

Temporal dynamics matter — user tastes drift.
User ratings are noisy and inconsistent.
Heterogeneous ensembles win benchmarks.
Offline gains (RMSE) ≠ production value — Netflix only shipped SVD++ + RBM.