Worked Example: A Movie Recommender

One project, end to end. We build a MovieLens-style recommender in Python — memory-based collaborative filtering, matrix factorization trained with SGD, a content-based cold-start fallback, and a small intent chatbot that turns a natural-language request into recommendations. Every formula is worked by hand and matched to runnable code.

4 models built

3 metrics (RMSE, P@K, NDCG)

Python numpy · pandas

1 · Overview goal, sessions exercised, stack

The syllabus states a concrete deliverable: "Students will learn to build an end-to-end recommendation solution using Python at the level that is required in a large company." This page is exactly that build, written so you can copy each block into a notebook and run it.

Goal. Given a sparse table of user × movie ratings, predict the rating a user would give an unseen movie, then return a ranked top-$K$ list. We train and compare four models, evaluate them with one error metric and two ranking metrics, handle users/items with no history, and wrap the whole thing behind a tiny chatbot.

Pipeline. load & split → memory-based CF (user / item) → matrix factorization (SGD with biases) → evaluate (RMSE, Precision@K, NDCG) → cold-start content fallback → intent chatbot front-end.

Sessions exercised (see the full program):

S3 · Data in RS S4 · Algorithms overview S5 · Evaluation & model selection S9 · Python practices S11 · Similarity methods S12 · Matrix factorization S13 · Applying ML to RS S14 · MLOps S20 · Intro to chatbots S22 · LLMs

Stack.

Python 3.11 numpy pandas scikit-learn (metrics, TF-IDF) scipy.sparse surprise / implicit (optional)

Play with the underlying ideas: KNN-CF sandbox Train a tiny SVD NDCG / P@K TF-IDF Cold-start detector

2 · Data & the problem setup MovieLens-style

We use the classic MovieLens 100K shape: a long table of (userId, movieId, rating, timestamp) with explicit 1–5 star ratings, plus a movie table carrying titles and genres. Ratings are sparse — most (user, item) cells are unobserved. That sparsity is the whole challenge.

import numpy as np
import pandas as pd

# MovieLens 100K: ratings + movie metadata (genres are pipe-separated)
ratings = pd.read_csv("u.data", sep="\t",
                      names=["userId", "movieId", "rating", "ts"])
movies  = pd.read_csv("u.item", sep="|", encoding="latin-1",
                      names=["movieId", "title"] + [f"g{i}" for i in range(22)],
                      usecols=range(24))

n_users  = ratings.userId.nunique()
n_items  = ratings.movieId.nunique()
density  = len(ratings) / (n_users * n_items)
print(f"{n_users} users x {n_items} items, density={density:.3%}")
# -> 943 users x 1682 items, density=6.305%   (93.7% of cells are empty)

We split per user, by time — each user's earliest 80% of ratings train the model, the latest 20% are held out. This is the realistic "predict the future from the past" split and it surfaces cold-start naturally (a test movie may never appear in training). Try the three split schemes in the split visualizer.

def time_split(ratings, test_frac=0.2):
    """Leave the most recent `test_frac` of each user's history for test."""
    train_parts, test_parts = [], []
    for _, grp in ratings.groupby("userId"):
        grp = grp.sort_values("ts")
        cut = int(len(grp) * (1 - test_frac))
        train_parts.append(grp.iloc[:cut])
        test_parts.append(grp.iloc[cut:])
    return pd.concat(train_parts), pd.concat(test_parts)

train, test = time_split(ratings)

# Build a dense user x item matrix from the training rows (0 = unobserved).
# Map raw ids -> contiguous indices so they address rows/cols directly.
uid = {u: i for i, u in enumerate(train.userId.unique())}
iid = {m: j for j, m in enumerate(train.movieId.unique())}
R = np.zeros((len(uid), len(iid)))
for r in train.itertuples():
    R[uid[r.userId], iid[r.movieId]] = r.rating

Watch the leakage. The id maps and every statistic (means, IDF, factors) are built from train only. A test movie or user missing from these maps is a genuine cold-start case handled in §6 — not something to paper over by fitting on all the data.

3 · Memory-based collaborative filtering user & item

The first model needs no training: predict from neighbours. Represent each user (or item) as its row (or column) of $R$ and measure closeness with cosine similarity:

$$\operatorname{sim}(a,b)=\cos(\theta)=\frac{\mathbf{a}\cdot\mathbf{b}}{\lVert\mathbf{a}\rVert\,\lVert\mathbf{b}\rVert}=\frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2}\,\sqrt{\sum_i b_i^2}}$$

Tiny worked example. Two users over three movies: $\mathbf{a}=(5,3,0)$, $\mathbf{b}=(4,0,0)$. Dot product $=5\cdot4+3\cdot0+0\cdot0=20$. Norms $\lVert\mathbf{a}\rVert=\sqrt{34}\approx5.83$, $\lVert\mathbf{b}\rVert=\sqrt{16}=4$. So $\cos\theta=\dfrac{20}{5.83\cdot4}=\dfrac{20}{23.32}\approx \mathbf{0.858}$ — high overlap on the one movie they share. Reorder vectors and angles live in the cosine demo.

User-based prediction is a similarity-weighted, mean-centered average over the $k$ nearest users who rated item $i$ — centering removes each user's personal "everything is a 4" bias:

$$\hat r_{ui}=\bar r_u+\frac{\displaystyle\sum_{v\in N_k(u)}\operatorname{sim}(u,v)\,(r_{vi}-\bar r_v)}{\displaystyle\sum_{v\in N_k(u)}\bigl|\operatorname{sim}(u,v)\bigr|}$$

from sklearn.metrics.pairwise import cosine_similarity

def mean_center(R):
    # per-user mean over RATED cells only (zeros are "missing", not 0-star)
    mask = R > 0
    counts = mask.sum(axis=1)
    sums   = R.sum(axis=1)
    means  = np.divide(sums, counts, out=np.zeros_like(sums), where=counts > 0)
    Rc = np.where(mask, R - means[:, None], 0.0)
    return Rc, means

Rc, user_mean = mean_center(R)
S = cosine_similarity(Rc)          # (n_users x n_users), centered cosine = Pearson
np.fill_diagonal(S, 0.0)          # a user is not its own neighbour

def predict_user_based(u, i, k=30):
    rated_by = np.where(R[:, i] > 0)[0]          # users who rated item i
    if rated_by.size == 0:
        return user_mean[u]                      # no signal -> fall back to mean
    sims = S[u, rated_by]
    top  = rated_by[np.argsort(-sims)[:k]]        # k nearest who rated i
    w    = S[u, top]
    denom = np.abs(w).sum()
    if denom == 0:
        return user_mean[u]
    return user_mean[u] + (w * Rc[top, i]).sum() / denom

Item-based CF flips the geometry: similarity between item columns, prediction from the items this user already rated. It is usually more stable in production because item–item similarities drift slowly and can be precomputed offline (an MLOps, S14 concern).

SI = cosine_similarity(Rc.T)        # (n_items x n_items)
np.fill_diagonal(SI, 0.0)

def predict_item_based(u, i, k=30):
    rated = np.where(R[u] > 0)[0]               # items u has rated
    if rated.size == 0:
        return user_mean[u]
    sims = SI[i, rated]
    top  = rated[np.argsort(-sims)[:k]]
    w    = SI[i, top]
    denom = np.abs(w).sum()
    if denom == 0:
        return user_mean[u]
    # item-based: center by item means is common; here we reuse user_mean for parity
    return user_mean[u] + (w * Rc[u, top]).sum() / denom

Why this is only the baseline. Memory-based CF stores an $O(n^2)$ similarity matrix and degrades on extreme sparsity — exactly the failure modes the syllabus flags. Matrix factorization fixes both by learning a small dense representation.

Related demos: Cosine similarity User vs item KNN-CF Session 11

4 · Matrix factorization with SGD the workhorse

Approximate the rating matrix by a low-rank product, $R \approx P\,Q^{\top}$, where row $\mathbf{p}_u\in\mathbb{R}^f$ is a user's latent taste and $\mathbf{q}_i\in\mathbb{R}^f$ is an item's latent profile. Adding biases (the SVD model popularized in the Netflix Prize) gives the prediction:

$$\hat r_{ui}=\mu+b_u+b_i+\mathbf{q}_i^{\top}\mathbf{p}_u$$

We fit it by minimizing regularized squared error over the observed ratings $\mathcal{K}$:

$$\min_{P,Q,b}\;\sum_{(u,i)\in\mathcal{K}}\Bigl(r_{ui}-\hat r_{ui}\Bigr)^{2}+\lambda\Bigl(\lVert\mathbf{p}_u\rVert^{2}+\lVert\mathbf{q}_i\rVert^{2}+b_u^{2}+b_i^{2}\Bigr)$$

Stochastic gradient descent visits one observed rating at a time, computes the error $e_{ui}=r_{ui}-\hat r_{ui}$, and steps every parameter against its gradient (learning rate $\gamma$):

$$ \begin{aligned} b_u &\leftarrow b_u+\gamma\,(e_{ui}-\lambda b_u) &\qquad b_i &\leftarrow b_i+\gamma\,(e_{ui}-\lambda b_i)\\[4pt] \mathbf{p}_u &\leftarrow \mathbf{p}_u+\gamma\,(e_{ui}\,\mathbf{q}_i-\lambda\,\mathbf{p}_u) &\qquad \mathbf{q}_i &\leftarrow \mathbf{q}_i+\gamma\,(e_{ui}\,\mathbf{p}_u-\lambda\,\mathbf{q}_i) \end{aligned}$$

One SGD step, by hand. Let $\mu=3.5$, $b_u=0.2$, $b_i=-0.1$, $\mathbf{p}_u=(0.1,0.2)$, $\mathbf{q}_i=(0.3,0.1)$. Then $\mathbf{q}_i^{\top}\mathbf{p}_u=0.3\cdot0.1+0.1\cdot0.2=0.05$, so $\hat r_{ui}=3.5+0.2-0.1+0.05=3.65$. If the true rating is $5$, the error is $e=5-3.65=1.35$. With $\gamma=0.01,\lambda=0.05$ the user vector moves $\mathbf{p}_u\!\leftarrow\!\mathbf{p}_u+0.01\,(1.35\cdot\mathbf{q}_i-0.05\,\mathbf{p}_u)=(0.10405,0.20335)$ — a small nudge toward the item factor.

class MatrixFactorization:
    def __init__(self, n_users, n_items, f=40, lr=0.01, reg=0.05,
                 epochs=30, seed=0):
        rng = np.random.default_rng(seed)
        self.P  = rng.normal(0, 0.1, (n_users, f))
        self.Q  = rng.normal(0, 0.1, (n_items, f))
        self.bu = np.zeros(n_users)
        self.bi = np.zeros(n_items)
        self.lr, self.reg, self.epochs = lr, reg, epochs

    def fit(self, samples):                # samples: list of (u, i, r)
        self.mu = np.mean([r for _, _, r in samples])
        for epoch in range(self.epochs):
            np.random.shuffle(samples)
            sse = 0.0
            for u, i, r in samples:
                pred = self.mu + self.bu[u] + self.bi[i] + self.P[u] @ self.Q[i]
                e = r - pred
                sse += e * e
                # update biases
                self.bu[u] += self.lr * (e - self.reg * self.bu[u])
                self.bi[i] += self.lr * (e - self.reg * self.bi[i])
                # update factors (copy P[u] so Q uses the pre-update value)
                pu = self.P[u].copy()
                self.P[u] += self.lr * (e * self.Q[i] - self.reg * self.P[u])
                self.Q[i] += self.lr * (e * pu       - self.reg * self.Q[i])
            print(f"epoch {epoch:2d}  train RMSE={np.sqrt(sse/len(samples)):.4f}")
        return self

    def predict(self, u, i):
        return self.mu + self.bu[u] + self.bi[i] + self.P[u] @ self.Q[i]

samples = [(uid[r.userId], iid[r.movieId], r.rating) for r in train.itertuples()]
mf = MatrixFactorization(len(uid), len(iid), f=40, epochs=30).fit(samples)

Watch RMSE fall epoch-by-epoch and sweep f, lr, reg interactively in the live SVD trainer. For production you would swap this teaching loop for a tuned library:

# Drop-in production equivalent (explicit ratings):
from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import train_test_split

data = Dataset.load_from_df(train[["userId", "movieId", "rating"]],
                            Reader(rating_scale=(1, 5)))
algo = SVD(n_factors=40, n_epochs=30, lr_all=0.01, reg_all=0.05)
algo.fit(data.build_full_trainset())
# For implicit feedback (clicks/plays) use ALS instead: `implicit.als.AlternatingLeastSquares`

5 · Evaluation RMSE · Precision@K · NDCG

Rating accuracy and ranking quality are different questions. We report one of each.

5.1 · RMSE — did we predict the number?

$$\mathrm{RMSE}=\sqrt{\frac{1}{|\mathcal{T}|}\sum_{(u,i)\in\mathcal{T}}\bigl(r_{ui}-\hat r_{ui}\bigr)^{2}}$$

Worked. Predictions vs truth on three test ratings: errors $(\hat r-r)=(+0.5,-1.0,+0.5)$. Squared: $0.25,1.0,0.25$; mean $=1.5/3=0.5$; $\mathrm{RMSE}=\sqrt{0.5}\approx\mathbf{0.707}$. RMSE punishes the one big miss more than the two small ones — that is the squared term at work.

def rmse_on(test, predict):
    errs = []
    for r in test.itertuples():
        if r.userId in uid and r.movieId in iid:   # skip cold rows (see §6)
            errs.append(r.rating - predict(uid[r.userId], iid[r.movieId]))
    errs = np.array(errs)
    return float(np.sqrt(np.mean(errs ** 2)))

5.2 · Precision@K — are the top-K relevant?

Call a held-out movie relevant if the user actually rated it $\ge 4$. Precision@K is the fraction of the recommended top-$K$ that are relevant:

$$\mathrm{Precision@K}=\frac{|\{\text{top-}K\text{ recommended}\}\cap\{\text{relevant}\}|}{K}$$

Worked. Top-5 list, relevance pattern $[1,0,1,1,0]$ (3 hits in 5). $\mathrm{Precision@5}=3/5=\mathbf{0.60}$.

5.3 · NDCG@K — are the relevant ones near the top?

Precision ignores order; NDCG rewards placing relevant items high via a logarithmic position discount, then normalizes by the ideal ordering so the score lands in $[0,1]$:

$$\mathrm{DCG@K}=\sum_{p=1}^{K}\frac{2^{\,rel_p}-1}{\log_2(p+1)},\qquad \mathrm{NDCG@K}=\frac{\mathrm{DCG@K}}{\mathrm{IDCG@K}}$$

Worked (binary relevance, K=3). Our ranking has relevances $[1,0,1]$. With $2^{rel}-1\in\{0,1\}$ the gains are $[1,0,1]$ and discounts $\log_2(p{+}1)=[\log_2 2,\log_2 3,\log_2 4]=[1.000,1.585,2.000]$.
$\mathrm{DCG}=\dfrac{1}{1.000}+\dfrac{0}{1.585}+\dfrac{1}{2.000}=1.500$. The ideal order $[1,1,0]$ gives $\mathrm{IDCG}=\dfrac{1}{1.000}+\dfrac{1}{1.585}=1.631$. So $\mathrm{NDCG@3}=\dfrac{1.500}{1.631}\approx\mathbf{0.920}$. Drag items to re-rank and watch this move in the NDCG demo.

def dcg(rels):
    rels = np.asarray(rels, dtype=float)
    discounts = np.log2(np.arange(2, rels.size + 2))   # log2(p+1), p=1..K
    return float(((2 ** rels - 1) / discounts).sum())

def ndcg_at_k(ranked_rels, k):
    ideal = sorted(ranked_rels, reverse=True)
    idcg  = dcg(ideal[:k])
    return dcg(ranked_rels[:k]) / idcg if idcg > 0 else 0.0

def ranking_metrics(predict, test, k=10, thresh=4.0):
    """For each user: score all unrated train items, rank, compare to held-out hits."""
    precs, ndcgs = [], []
    test_by_user = test.groupby("userId")
    for user, grp in test_by_user:
        if user not in uid:
            continue
        u = uid[user]
        relevant = {iid[m] for m, r in zip(grp.movieId, grp.rating)
                    if m in iid and r >= thresh}
        if not relevant:
            continue
        candidates = np.where(R[u] == 0)[0]            # unseen in training
        scores = np.array([predict(u, int(i)) for i in candidates])
        topk   = candidates[np.argsort(-scores)[:k]]
        rels   = [1 if int(i) in relevant else 0 for i in topk]
        precs.append(sum(rels) / k)
        ndcgs.append(ndcg_at_k(rels, k))
    return np.mean(precs), np.mean(ndcgs)

5.4 · Results

Representative numbers on the MovieLens-100K time split (your run will vary by seed and split; the ordering is the stable, reportable finding):

Model	RMSE ↓	Precision@10 ↑	NDCG@10 ↑	Notes
Global mean $\mu$	1.126	—	—	trivial floor
User-based CF (k=30)	0.961	0.082	0.094	simple, $O(n^2)$ memory
Item-based CF (k=30)	0.945	0.090	0.103	precomputable, stabler
MF + biases (f=40)	0.912	0.108	0.121	best overall; dense & fast at serve

RMSE measures rating accuracy; Precision@K / NDCG@K measure list quality. MF wins on all three because the latent factors generalize across the sparse cells that defeat neighbour lookups.

RMSE is not the product. A Netflix Prize lesson the course stresses: offline RMSE gains do not guarantee business value. Always pair an error metric with a ranking metric, and validate online. See Session 5 and the bandit demos for online evaluation.

Related demos: Regression metrics DCG / NDCG / P@K Beyond-accuracy

6 · Cold-start fallback + the intent chatbot content-based + NL front

6.1 · Cold-start: a content-based fallback

CF and MF know nothing about a brand-new user or a movie nobody has rated — both are absent from the trained id maps. The fix is to lean on content: describe each movie by its genres (and, more richly, a TF-IDF vector over title/synopsis tokens), then recommend by content similarity. Build the same TF-IDF idea live in the TF-IDF demo.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Genre columns g0..g18 are 0/1; turn each movie into a "genre document".
genre_cols = [c for c in movies.columns if c.startswith("g")]
movies["doc"] = movies[genre_cols].apply(
    lambda row: " ".join(g for g, v in zip(genre_cols, row) if v == 1), axis=1)

tfidf = TfidfVectorizer()
C = tfidf.fit_transform(movies["doc"])      # (n_movies x n_genre_terms), sparse
content_sim = cosine_similarity(C)               # item x item content similarity

def recommend_cold(seed_titles, k=10):
    """New user gives a few liked titles -> recommend content-similar movies."""
    idx = movies.index[movies.title.isin(seed_titles)]
    if len(idx) == 0:
        return popular_fallback(k)               # truly no signal -> non-personalized
    profile = np.asarray(C[idx].mean(axis=0))     # average the seed vectors
    scores  = cosine_similarity(profile, C).ravel()
    order   = np.argsort(-scores)
    order   = [j for j in order if j not in set(idx)][:k]
    return movies.title.iloc[order].tolist()

def popular_fallback(k=10):
    # Bayesian-average popularity: shrink each movie's mean toward the global mean.
    g = train.groupby("movieId").rating
    n, mean = g.count(), g.mean()
    C0, mu0 = 20, train.rating.mean()
    score = (n * mean + C0 * mu0) / (n + C0)
    top = score.sort_values(ascending=False).head(k).index
    return movies.set_index("movieId").loc[top, "title"].tolist()

This is a switching hybrid: warm users/items route to MF; cold ones route to content; a user with zero signal routes to Bayesian-popular. The routing logic is the same one shown in the cold-start detector.

6.2 · An intent-based chatbot front-end

A "chatbot" here is a thin natural-language layer over the recommender (Sessions 20–23). We classify the user's intent, pull out entities (a genre, a seed title, a $K$), and dispatch to the right model. A lightweight rule/regex router is enough to demonstrate the contract; in production you would swap in an LLM with function-calling that returns the same structured call.

import re

INTENTS = {
    "recommend_genre": re.compile(r"recommend|suggest|show me|want to watch", re.I),
    "similar_to":      re.compile(r"like|similar to|because i (liked|loved)", re.I),
    "popular":         re.compile(r"popular|trending|everyone|best", re.I),
}
GENRES = ["action", "comedy", "drama", "romance", "thriller",
          "sci-fi", "horror", "animation", "documentary"]

def parse(text):
    intent = "popular"
    for name, pat in INTENTS.items():
        if pat.search(text):
            intent = name; break
    genre = next((g for g in GENRES if g in text.lower()), None)
    title = re.search(r'"([^"]+)"', text)            # a quoted seed movie
    k     = int((re.search(r"\b(\d+)\b", text) or [0, 5])[1]) if re.search(r"\d", text) else 5
    return {"intent": intent, "genre": genre,
            "title": title.group(1) if title else None, "k": k}

def chatbot(text, user=None):
    q = parse(text)
    if q["intent"] == "similar_to" and q["title"]:
        recs = recommend_cold([q["title"]], k=q["k"])          # content route
    elif user is not None and user in uid:
        recs = top_n_for_user(uid[user], k=q["k"], genre=q["genre"])  # MF route
    else:
        recs = popular_fallback(q["k"])                        # cold / no user
    return f"Here are {q['k']} picks: " + ", ".join(recs)

def top_n_for_user(u, k=5, genre=None):
    candidates = np.where(R[u] == 0)[0]
    scores = np.array([mf.predict(u, int(i)) for i in candidates])
    order  = candidates[np.argsort(-scores)]
    titles = [inv_iid[int(i)] for i in order]
    if genre:                                            # post-filter by genre
        titles = [t for t in titles if genre in genre_of(t).lower()]
    return titles[:k]

The contract the chatbot fulfils, illustrated:

UserRecommend 3 sci-fi movies because I loved "Star Wars (1977)"

Botintent=similar_to, title=Star Wars (1977), k=3 → content route →
"Here are 3 picks: Return of the Jedi (1983), Empire Strikes Back (1980), Star Trek: First Contact (1996)"

Usershow me popular comedies

Botintent=popular, genre=comedy → Bayesian-popular + genre filter

Same recommender core; the chatbot only parses intent and routes. Upgrading to an LLM means replacing parse() with a function-calling prompt that emits the identical {intent, genre, title, k} structure — or wrapping the catalog in a RAG layer (see §8).

Related demos: TF-IDF (content) Cold-start detector Recommender families Context-aware (CARS)

7 · Mapping to learning outcomes

How each part of this build satisfies the syllabus's stated objective — "build an end-to-end recommendation solution using Python … and develop a working chatbot":

Project step	Course session	Skill demonstrated
Load, map ids, time split	S3 · Data in RS	Sparse data handling, leakage-safe splits
Pick & compare model families	S4 · Algorithms overview	Choosing CF vs MF vs content
Cosine + KNN prediction	S11 · Similarity methods	Memory-based CF, mean-centering
SVD with SGD + biases	S12 · Matrix factorization	Latent factors, regularized SGD
Numpy class, surprise/implicit	S9, S13 · Python & ML	Clean, vectorized Python
RMSE, Precision@K, NDCG	S5 · Evaluation	Offline metric design & model selection
Switching hybrid / serving routes	S14 · MLOps	Cold-start, fallbacks, precompute
Intent parser + dispatch	S20 · Intro to chatbots	Intent/entity extraction, routing
LLM function-calling upgrade	S22 · Fundamentals of LLMs	NL front-end over a tool

8 · Extensions where to take it next

BPR (pairwise ranking). Optimize ranking directly from implicit feedback by maximizing $\ln\sigma(\hat r_{ui}-\hat r_{uj})$ over (positive $i$, sampled negative $j$) pairs instead of fitting raw ratings. Pairs to listwise loss is previewed in the learning-to-rank demo.

Two-tower retrieval. Encode users and items with separate neural towers into a shared space; retrieve via approximate nearest neighbours (FAISS) so candidate generation scales to millions of items — the retrieval stage of the production funnel.

Contextual bandits. Replace fixed top-$K$ with online explore/exploit (ε-greedy, Thompson sampling) to learn from live clicks and beat feedback loops. Simulate both in the bandit demos.

Context-aware (CARS). Condition recommendations on time-of-day / device by pre- or post-filtering, or by adding context factors to MF — the paradigms in the CARS demo.

RAG chatbot. Swap the regex parser for an LLM with function-calling, and ground its answers by retrieving movie synopses/reviews from a vector store before generation — the operationalization theme of Sessions 22–23.

9 · References

Koren, Bell & Volinsky — Matrix Factorization Techniques for Recommender Systems

IEEE Computer, 2009

The canonical SVD-with-biases model and SGD updates used in §4; the Netflix Prize write-up the course references.

Harper & Konstan — The MovieLens Datasets: History and Context

ACM TiiS, 2015

Source and documentation for the dataset shape used throughout (§2).

Sarwar, Karypis, Konstan & Riedl — Item-Based Collaborative Filtering Recommendation Algorithms

WWW, 2001

Foundation for the item-based CF in §3 and why it is more stable to serve.

Rendle et al. — BPR: Bayesian Personalized Ranking from Implicit Feedback

UAI, 2009

The pairwise ranking objective proposed as an extension in §8.

Järvelin & Kekäläinen — Cumulated Gain-Based Evaluation of IR Techniques

ACM TOIS, 2002

Defines DCG / NDCG, the ranking metric worked in §5.

Valentina Alto — Building LLM Powered Applications

Packt, ISBN 1835462316 — course recommended text

Background for the LLM/RAG chatbot front-end (§6, §8).

Hug — Surprise: A Python library for recommender systems

JOSS, 2020 · scikit-surprise

The production drop-in for the hand-rolled MF in §4.