Worked Example: A Movie Recommender
One project, end to end. We build a MovieLens-style recommender in Python — memory-based collaborative filtering, matrix factorization trained with SGD, a content-based cold-start fallback, and a small intent chatbot that turns a natural-language request into recommendations. Every formula is worked by hand and matched to runnable code.
1 · Overview goal, sessions exercised, stack
The syllabus states a concrete deliverable: "Students will learn to build an end-to-end recommendation solution using Python at the level that is required in a large company." This page is exactly that build, written so you can copy each block into a notebook and run it.
Goal. Given a sparse table of user × movie ratings, predict the rating a user would give an unseen movie, then return a ranked top-$K$ list. We train and compare four models, evaluate them with one error metric and two ranking metrics, handle users/items with no history, and wrap the whole thing behind a tiny chatbot.
Sessions exercised (see the full program):
Stack.
2 · Data & the problem setup MovieLens-style
We use the classic MovieLens 100K shape: a long table of (userId, movieId, rating, timestamp) with explicit 1–5 star ratings, plus a movie table carrying titles and genres. Ratings are sparse — most (user, item) cells are unobserved. That sparsity is the whole challenge.
import numpy as np
import pandas as pd
# MovieLens 100K: ratings + movie metadata (genres are pipe-separated)
ratings = pd.read_csv("u.data", sep="\t",
names=["userId", "movieId", "rating", "ts"])
movies = pd.read_csv("u.item", sep="|", encoding="latin-1",
names=["movieId", "title"] + [f"g{i}" for i in range(22)],
usecols=range(24))
n_users = ratings.userId.nunique()
n_items = ratings.movieId.nunique()
density = len(ratings) / (n_users * n_items)
print(f"{n_users} users x {n_items} items, density={density:.3%}")
# -> 943 users x 1682 items, density=6.305% (93.7% of cells are empty)
We split per user, by time — each user's earliest 80% of ratings train the model, the latest 20% are held out. This is the realistic "predict the future from the past" split and it surfaces cold-start naturally (a test movie may never appear in training). Try the three split schemes in the split visualizer.
def time_split(ratings, test_frac=0.2):
"""Leave the most recent `test_frac` of each user's history for test."""
train_parts, test_parts = [], []
for _, grp in ratings.groupby("userId"):
grp = grp.sort_values("ts")
cut = int(len(grp) * (1 - test_frac))
train_parts.append(grp.iloc[:cut])
test_parts.append(grp.iloc[cut:])
return pd.concat(train_parts), pd.concat(test_parts)
train, test = time_split(ratings)
# Build a dense user x item matrix from the training rows (0 = unobserved).
# Map raw ids -> contiguous indices so they address rows/cols directly.
uid = {u: i for i, u in enumerate(train.userId.unique())}
iid = {m: j for j, m in enumerate(train.movieId.unique())}
R = np.zeros((len(uid), len(iid)))
for r in train.itertuples():
R[uid[r.userId], iid[r.movieId]] = r.rating
train only. A test movie or user missing from these maps is a genuine cold-start case handled in §6 — not something to paper over by fitting on all the data.3 · Memory-based collaborative filtering user & item
The first model needs no training: predict from neighbours. Represent each user (or item) as its row (or column) of $R$ and measure closeness with cosine similarity:
User-based prediction is a similarity-weighted, mean-centered average over the $k$ nearest users who rated item $i$ — centering removes each user's personal "everything is a 4" bias:
from sklearn.metrics.pairwise import cosine_similarity
def mean_center(R):
# per-user mean over RATED cells only (zeros are "missing", not 0-star)
mask = R > 0
counts = mask.sum(axis=1)
sums = R.sum(axis=1)
means = np.divide(sums, counts, out=np.zeros_like(sums), where=counts > 0)
Rc = np.where(mask, R - means[:, None], 0.0)
return Rc, means
Rc, user_mean = mean_center(R)
S = cosine_similarity(Rc) # (n_users x n_users), centered cosine = Pearson
np.fill_diagonal(S, 0.0) # a user is not its own neighbour
def predict_user_based(u, i, k=30):
rated_by = np.where(R[:, i] > 0)[0] # users who rated item i
if rated_by.size == 0:
return user_mean[u] # no signal -> fall back to mean
sims = S[u, rated_by]
top = rated_by[np.argsort(-sims)[:k]] # k nearest who rated i
w = S[u, top]
denom = np.abs(w).sum()
if denom == 0:
return user_mean[u]
return user_mean[u] + (w * Rc[top, i]).sum() / denom
Item-based CF flips the geometry: similarity between item columns, prediction from the items this user already rated. It is usually more stable in production because item–item similarities drift slowly and can be precomputed offline (an MLOps, S14 concern).
SI = cosine_similarity(Rc.T) # (n_items x n_items)
np.fill_diagonal(SI, 0.0)
def predict_item_based(u, i, k=30):
rated = np.where(R[u] > 0)[0] # items u has rated
if rated.size == 0:
return user_mean[u]
sims = SI[i, rated]
top = rated[np.argsort(-sims)[:k]]
w = SI[i, top]
denom = np.abs(w).sum()
if denom == 0:
return user_mean[u]
# item-based: center by item means is common; here we reuse user_mean for parity
return user_mean[u] + (w * Rc[u, top]).sum() / denom
4 · Matrix factorization with SGD the workhorse
Approximate the rating matrix by a low-rank product, $R \approx P\,Q^{\top}$, where row $\mathbf{p}_u\in\mathbb{R}^f$ is a user's latent taste and $\mathbf{q}_i\in\mathbb{R}^f$ is an item's latent profile. Adding biases (the SVD model popularized in the Netflix Prize) gives the prediction:
We fit it by minimizing regularized squared error over the observed ratings $\mathcal{K}$:
Stochastic gradient descent visits one observed rating at a time, computes the error $e_{ui}=r_{ui}-\hat r_{ui}$, and steps every parameter against its gradient (learning rate $\gamma$):
class MatrixFactorization:
def __init__(self, n_users, n_items, f=40, lr=0.01, reg=0.05,
epochs=30, seed=0):
rng = np.random.default_rng(seed)
self.P = rng.normal(0, 0.1, (n_users, f))
self.Q = rng.normal(0, 0.1, (n_items, f))
self.bu = np.zeros(n_users)
self.bi = np.zeros(n_items)
self.lr, self.reg, self.epochs = lr, reg, epochs
def fit(self, samples): # samples: list of (u, i, r)
self.mu = np.mean([r for _, _, r in samples])
for epoch in range(self.epochs):
np.random.shuffle(samples)
sse = 0.0
for u, i, r in samples:
pred = self.mu + self.bu[u] + self.bi[i] + self.P[u] @ self.Q[i]
e = r - pred
sse += e * e
# update biases
self.bu[u] += self.lr * (e - self.reg * self.bu[u])
self.bi[i] += self.lr * (e - self.reg * self.bi[i])
# update factors (copy P[u] so Q uses the pre-update value)
pu = self.P[u].copy()
self.P[u] += self.lr * (e * self.Q[i] - self.reg * self.P[u])
self.Q[i] += self.lr * (e * pu - self.reg * self.Q[i])
print(f"epoch {epoch:2d} train RMSE={np.sqrt(sse/len(samples)):.4f}")
return self
def predict(self, u, i):
return self.mu + self.bu[u] + self.bi[i] + self.P[u] @ self.Q[i]
samples = [(uid[r.userId], iid[r.movieId], r.rating) for r in train.itertuples()]
mf = MatrixFactorization(len(uid), len(iid), f=40, epochs=30).fit(samples)
Watch RMSE fall epoch-by-epoch and sweep f, lr, reg interactively in the live SVD trainer. For production you would swap this teaching loop for a tuned library:
# Drop-in production equivalent (explicit ratings):
from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import train_test_split
data = Dataset.load_from_df(train[["userId", "movieId", "rating"]],
Reader(rating_scale=(1, 5)))
algo = SVD(n_factors=40, n_epochs=30, lr_all=0.01, reg_all=0.05)
algo.fit(data.build_full_trainset())
# For implicit feedback (clicks/plays) use ALS instead: `implicit.als.AlternatingLeastSquares`
5 · Evaluation RMSE · Precision@K · NDCG
Rating accuracy and ranking quality are different questions. We report one of each.
5.1 · RMSE — did we predict the number?
def rmse_on(test, predict):
errs = []
for r in test.itertuples():
if r.userId in uid and r.movieId in iid: # skip cold rows (see §6)
errs.append(r.rating - predict(uid[r.userId], iid[r.movieId]))
errs = np.array(errs)
return float(np.sqrt(np.mean(errs ** 2)))
5.2 · Precision@K — are the top-K relevant?
Call a held-out movie relevant if the user actually rated it $\ge 4$. Precision@K is the fraction of the recommended top-$K$ that are relevant:
5.3 · NDCG@K — are the relevant ones near the top?
Precision ignores order; NDCG rewards placing relevant items high via a logarithmic position discount, then normalizes by the ideal ordering so the score lands in $[0,1]$:
$\mathrm{DCG}=\dfrac{1}{1.000}+\dfrac{0}{1.585}+\dfrac{1}{2.000}=1.500$. The ideal order $[1,1,0]$ gives $\mathrm{IDCG}=\dfrac{1}{1.000}+\dfrac{1}{1.585}=1.631$. So $\mathrm{NDCG@3}=\dfrac{1.500}{1.631}\approx\mathbf{0.920}$. Drag items to re-rank and watch this move in the NDCG demo.
def dcg(rels):
rels = np.asarray(rels, dtype=float)
discounts = np.log2(np.arange(2, rels.size + 2)) # log2(p+1), p=1..K
return float(((2 ** rels - 1) / discounts).sum())
def ndcg_at_k(ranked_rels, k):
ideal = sorted(ranked_rels, reverse=True)
idcg = dcg(ideal[:k])
return dcg(ranked_rels[:k]) / idcg if idcg > 0 else 0.0
def ranking_metrics(predict, test, k=10, thresh=4.0):
"""For each user: score all unrated train items, rank, compare to held-out hits."""
precs, ndcgs = [], []
test_by_user = test.groupby("userId")
for user, grp in test_by_user:
if user not in uid:
continue
u = uid[user]
relevant = {iid[m] for m, r in zip(grp.movieId, grp.rating)
if m in iid and r >= thresh}
if not relevant:
continue
candidates = np.where(R[u] == 0)[0] # unseen in training
scores = np.array([predict(u, int(i)) for i in candidates])
topk = candidates[np.argsort(-scores)[:k]]
rels = [1 if int(i) in relevant else 0 for i in topk]
precs.append(sum(rels) / k)
ndcgs.append(ndcg_at_k(rels, k))
return np.mean(precs), np.mean(ndcgs)
5.4 · Results
Representative numbers on the MovieLens-100K time split (your run will vary by seed and split; the ordering is the stable, reportable finding):
| Model | RMSE ↓ | Precision@10 ↑ | NDCG@10 ↑ | Notes |
|---|---|---|---|---|
| Global mean $\mu$ | 1.126 | — | — | trivial floor |
| User-based CF (k=30) | 0.961 | 0.082 | 0.094 | simple, $O(n^2)$ memory |
| Item-based CF (k=30) | 0.945 | 0.090 | 0.103 | precomputable, stabler |
| MF + biases (f=40) | 0.912 | 0.108 | 0.121 | best overall; dense & fast at serve |
RMSE measures rating accuracy; Precision@K / NDCG@K measure list quality. MF wins on all three because the latent factors generalize across the sparse cells that defeat neighbour lookups.
6 · Cold-start fallback + the intent chatbot content-based + NL front
6.1 · Cold-start: a content-based fallback
CF and MF know nothing about a brand-new user or a movie nobody has rated — both are absent from the trained id maps. The fix is to lean on content: describe each movie by its genres (and, more richly, a TF-IDF vector over title/synopsis tokens), then recommend by content similarity. Build the same TF-IDF idea live in the TF-IDF demo.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Genre columns g0..g18 are 0/1; turn each movie into a "genre document".
genre_cols = [c for c in movies.columns if c.startswith("g")]
movies["doc"] = movies[genre_cols].apply(
lambda row: " ".join(g for g, v in zip(genre_cols, row) if v == 1), axis=1)
tfidf = TfidfVectorizer()
C = tfidf.fit_transform(movies["doc"]) # (n_movies x n_genre_terms), sparse
content_sim = cosine_similarity(C) # item x item content similarity
def recommend_cold(seed_titles, k=10):
"""New user gives a few liked titles -> recommend content-similar movies."""
idx = movies.index[movies.title.isin(seed_titles)]
if len(idx) == 0:
return popular_fallback(k) # truly no signal -> non-personalized
profile = np.asarray(C[idx].mean(axis=0)) # average the seed vectors
scores = cosine_similarity(profile, C).ravel()
order = np.argsort(-scores)
order = [j for j in order if j not in set(idx)][:k]
return movies.title.iloc[order].tolist()
def popular_fallback(k=10):
# Bayesian-average popularity: shrink each movie's mean toward the global mean.
g = train.groupby("movieId").rating
n, mean = g.count(), g.mean()
C0, mu0 = 20, train.rating.mean()
score = (n * mean + C0 * mu0) / (n + C0)
top = score.sort_values(ascending=False).head(k).index
return movies.set_index("movieId").loc[top, "title"].tolist()
6.2 · An intent-based chatbot front-end
A "chatbot" here is a thin natural-language layer over the recommender (Sessions 20–23). We classify the user's intent, pull out entities (a genre, a seed title, a $K$), and dispatch to the right model. A lightweight rule/regex router is enough to demonstrate the contract; in production you would swap in an LLM with function-calling that returns the same structured call.
import re
INTENTS = {
"recommend_genre": re.compile(r"recommend|suggest|show me|want to watch", re.I),
"similar_to": re.compile(r"like|similar to|because i (liked|loved)", re.I),
"popular": re.compile(r"popular|trending|everyone|best", re.I),
}
GENRES = ["action", "comedy", "drama", "romance", "thriller",
"sci-fi", "horror", "animation", "documentary"]
def parse(text):
intent = "popular"
for name, pat in INTENTS.items():
if pat.search(text):
intent = name; break
genre = next((g for g in GENRES if g in text.lower()), None)
title = re.search(r'"([^"]+)"', text) # a quoted seed movie
k = int((re.search(r"\b(\d+)\b", text) or [0, 5])[1]) if re.search(r"\d", text) else 5
return {"intent": intent, "genre": genre,
"title": title.group(1) if title else None, "k": k}
def chatbot(text, user=None):
q = parse(text)
if q["intent"] == "similar_to" and q["title"]:
recs = recommend_cold([q["title"]], k=q["k"]) # content route
elif user is not None and user in uid:
recs = top_n_for_user(uid[user], k=q["k"], genre=q["genre"]) # MF route
else:
recs = popular_fallback(q["k"]) # cold / no user
return f"Here are {q['k']} picks: " + ", ".join(recs)
def top_n_for_user(u, k=5, genre=None):
candidates = np.where(R[u] == 0)[0]
scores = np.array([mf.predict(u, int(i)) for i in candidates])
order = candidates[np.argsort(-scores)]
titles = [inv_iid[int(i)] for i in order]
if genre: # post-filter by genre
titles = [t for t in titles if genre in genre_of(t).lower()]
return titles[:k]
The contract the chatbot fulfils, illustrated:
similar_to, title=Star Wars (1977), k=3 → content route →"Here are 3 picks: Return of the Jedi (1983), Empire Strikes Back (1980), Star Trek: First Contact (1996)"
popular, genre=comedy → Bayesian-popular + genre filterSame recommender core; the chatbot only parses intent and routes. Upgrading to an LLM means replacing parse() with a function-calling prompt that emits the identical {intent, genre, title, k} structure — or wrapping the catalog in a RAG layer (see §8).
7 · Mapping to learning outcomes
How each part of this build satisfies the syllabus's stated objective — "build an end-to-end recommendation solution using Python … and develop a working chatbot":
| Project step | Course session | Skill demonstrated |
|---|---|---|
| Load, map ids, time split | S3 · Data in RS | Sparse data handling, leakage-safe splits |
| Pick & compare model families | S4 · Algorithms overview | Choosing CF vs MF vs content |
| Cosine + KNN prediction | S11 · Similarity methods | Memory-based CF, mean-centering |
| SVD with SGD + biases | S12 · Matrix factorization | Latent factors, regularized SGD |
| Numpy class, surprise/implicit | S9, S13 · Python & ML | Clean, vectorized Python |
| RMSE, Precision@K, NDCG | S5 · Evaluation | Offline metric design & model selection |
| Switching hybrid / serving routes | S14 · MLOps | Cold-start, fallbacks, precompute |
| Intent parser + dispatch | S20 · Intro to chatbots | Intent/entity extraction, routing |
| LLM function-calling upgrade | S22 · Fundamentals of LLMs | NL front-end over a tool |