Sentiment analysis pipeline:
classic → transformer.
One end-to-end project, fully worked. We take a corpus of movie reviews, build a transparent TF-IDF + logistic-regression baseline, then fine-tune a small DistilBERT transformer and compare them head-to-head — with real, runnable Python, the underlying math in KaTeX, and an honest error analysis. It threads together roughly a dozen sessions from the syllabus into a single deliverable.
The goal
Given the raw text of a movie review, predict whether the writer felt positive or negative about the film. This is exactly the practice exercise the syllabus sets out in Session 10 ("build a sentiment analysis tool … evaluate movie reviews") and revisits in Session 11. We treat it as a chance to contrast the two paradigms the course covers: hand-engineered sparse features fed to a linear model, versus a pre-trained contextual encoder we fine-tune.
Sessions exercised
Preprocessing & tokenization
Session 3 — clean, normalize, tokenize text into features.
TF-IDF & retrieval
Sessions 7–8 — weight terms by distinctiveness; build sparse vectors.
Logistic regression
Sessions 10–11 — a calibrated linear classifier with the sigmoid.
Embeddings
Session 13 — from sparse one-hot to dense contextual vectors.
Transformers
Sessions 19 & 22 — fine-tune a self-attention encoder.
Metrics & errors
Accuracy, F1, confusion matrix, and a qualitative error analysis.
pip install scikit-learn transformers datasets torch. The reported numbers come from a run on the
standard IMDb sentiment corpus; your exact figures will wobble by a point or two with different seeds and splits.The dataset
We use IMDb Large Movie Review (Maas et al., 2011): 50,000 reviews split evenly into 25k train / 25k test, with perfectly balanced positive/negative labels. Balanced classes mean accuracy is a meaningful headline metric — but we still report F1, because in deployment the class balance rarely stays 50/50.
from datasets import load_dataset # 25k train / 25k test, label 0 = negative, 1 = positive ds = load_dataset("imdb") train, test = ds["train"], ds["test"] print(len(train), len(test)) # 25000 25000 print(train[0]["label"]) # 0 print(train[0]["text"][:120]) # "I rented I AM CURIOUS-YELLOW ..." # class balance sanity check from collections import Counter print(Counter(train["label"])) # Counter({0: 12500, 1: 12500})
Why a held-out test set matters: every number we quote later is computed on reviews the model
never saw during training. Tuning anything against the test set — even by hand — silently inflates your results.
For the baseline we further carve a small validation slice out of train for hyper-parameter choices.
From raw text to tokens
The two paradigms want different preprocessing. The linear baseline benefits from aggressive
normalization (lowercasing, stripping HTML, collapsing inflections) so that Loved, loved
and loves collapse toward one feature. The transformer wants almost none of it — its subword
tokenizer and pretraining already handle case and morphology, and over-cleaning throws away signal it can use.
import re def clean(text): text = text.lower() text = re.sub(r"<br\s*/?>", " ", text) # IMDb is full of <br/> tags text = re.sub(r"[^a-z\s']", " ", text) # keep letters + apostrophes text = re.sub(r"\s+", " ", text).strip() # collapse whitespace return text print(clean("This movie was GREAT!!! <br/>Loved it.")) # -> "this movie was great loved it"
We let scikit-learn's vectorizer handle tokenization and stopword removal in the next step, so the
clean function only does what regex does best: normalization. This is the same regex toolkit from
Session 6. Note we keep apostrophes so don't and wasn't survive — negation is the
single most important cue in sentiment, and crushing it is a classic beginner mistake.
doomscrolling becomes doom ##scroll ##ing — no word is ever "unknown".
See the BPE merges demo for how
subword vocabularies are learned.TF-IDF + logistic regression
The workhorse baseline that every NLP practitioner reaches for first. It is fast, fully interpretable, runs on a CPU in seconds, and is shockingly hard to beat on long-document sentiment. If a deep model can't clear this bar, the deep model is the problem.
The math — TF-IDF
Each review becomes a sparse vector over the vocabulary. The weight of term $t$ in document $d$ is the term frequency times the inverse document frequency:
where $n$ is the number of documents and $\text{df}(t)$ is how many contain $t$ (scikit-learn's smoothed,
always-positive form). Words that appear everywhere (the, movie) get
$\text{idf}\approx 1$ and are down-weighted; rare, distinctive words (masterpiece,
unwatchable) get large weights. Vectors are then L2-normalized so review length doesn't dominate.
The math — logistic regression
We score the TF-IDF vector $\mathbf{x}$ with a weight vector $\mathbf{w}$ and squash to a probability with the sigmoid:
Training minimizes the binary cross-entropy (log loss) over the $N$ training examples, with an L2 penalty of strength $1/C$ to curb overfitting on the huge sparse feature space:
Each fitted weight $w_t$ is directly readable: a large positive $w_t$ means term $t$ pushes toward "positive". That interpretability is the baseline's superpower.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score, f1_score, classification_report X_train = [clean(t) for t in train["text"]] X_test = [clean(t) for t in test["text"]] y_train, y_test = train["label"], test["label"] clf = Pipeline([ ("tfidf", TfidfVectorizer( ngram_range=(1, 2), # unigrams + bigrams catch "not good" min_df=5, # drop terms in < 5 docs (noise) max_features=50_000, sublinear_tf=True, # 1 + log(tf), dampens repeats stop_words="english")), ("lr", LogisticRegression(C=10.0, max_iter=1000, n_jobs=-1)), ]) clf.fit(X_train, y_train) pred = clf.predict(X_test) print(f"accuracy: {accuracy_score(y_test, pred):.4f}") # 0.8989 print(f"macro-F1: {f1_score(y_test, pred, average='macro'):.4f}") # 0.8988
not good and good share the feature good — the model
can't see the negation. Adding the bigram not good as its own feature lets the linear model assign it
a negative weight. This single change typically buys ~1.5 accuracy points on sentiment.import numpy as np vocab = np.array(clf.named_steps["tfidf"].get_feature_names_out()) weights = clf.named_steps["lr"].coef_[0] order = np.argsort(weights) print("most POSITIVE:", vocab[order[-8:]][::-1]) # ['excellent' 'perfect' 'wonderful' 'great' 'best' 'amazing' 'favorite' 'superb'] print("most NEGATIVE:", vocab[order[:8]]) # ['worst' 'waste' 'awful' 'boring' 'poorly' 'terrible' 'bad' 'disappointment']
Those word lists are the entire model — no black box. This is the same logistic-regression machinery you can drag around interactively in the decision-boundary demo, and the TF-IDF weighting is exactly what powers the TF-IDF search demo.
Fine-tuning DistilBERT
Now the modern approach the syllabus reaches in Session 19 ("fine-tune a transformer for a text-classification assignment"). Instead of hand-built features, we start from DistilBERT — a 66M-parameter encoder pretrained on a large corpus — and adapt it to our task. It reads the whole review with self-attention, so word order and long-range context come for free.
The math — scaled dot-product attention
Every token is projected into a query $\mathbf{q}$, a key $\mathbf{k}$ and a value $\mathbf{v}$. A token's new representation is a weighted average of all values, where the weights come from how well its query matches each key. Stacked into matrices $Q, K, V$:
The $\sqrt{d_k}$ scaling keeps the dot products from growing with dimension and pushing the softmax into saturated, low-gradient regions. The row-wise softmax turns scores into a distribution:
Multi-head attention runs $h$ of these in parallel on different learned projections and concatenates them, so different heads can specialize (one tracks negation, another tracks the subject, and so on).
The math — the same loss, a deeper model
DistilBERT prepends a special [CLS] token; its final hidden vector $\mathbf{h}_{\text{CLS}}$ is fed
to a tiny linear head to produce two logits, and we again minimize cross-entropy — the very same objective as the
baseline, just over a far richer representation:
Fine-tuning means we backpropagate this loss through all the pretrained weights, nudging them toward the sentiment task rather than learning from scratch — the transfer-learning idea from Session 22.
from transformers import AutoTokenizer ckpt = "distilbert-base-uncased" tok = AutoTokenizer.from_pretrained(ckpt) def tokenize(batch): # truncate to 256 tokens: most sentiment cues are early, and it # keeps fine-tuning fast. raise to 512 for the last accuracy point. return tok(batch["text"], truncation=True, max_length=256) ds_tok = ds.map(tokenize, batched=True) print(tok.tokenize("an unwatchable mess")) # ['an', 'un', '##watch', '##able', 'mess'] <- subwords, no UNKs
import numpy as np import evaluate from transformers import (AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding) model = AutoModelForSequenceClassification.from_pretrained(ckpt, num_labels=2) collator = DataCollatorWithPadding(tok) # dynamic padding per batch acc, f1 = evaluate.load("accuracy"), evaluate.load("f1") def metrics(eval_pred): logits, labels = eval_pred preds = np.argmax(logits, axis=-1) return {"accuracy": acc.compute(predictions=preds, references=labels)["accuracy"], "f1": f1.compute(predictions=preds, references=labels, average="macro")["f1"]} args = TrainingArguments( output_dir="distilbert-imdb", learning_rate=2e-5, # small LR: we're nudging, not retraining per_device_train_batch_size=16, num_train_epochs=2, # 1-2 epochs is plenty for fine-tuning weight_decay=0.01, eval_strategy="epoch", fp16=True) # mixed precision on a GPU trainer = Trainer(model=model, args=args, train_dataset=ds_tok["train"], eval_dataset=ds_tok["test"], tokenizer=tok, data_collator=collator, compute_metrics=metrics) trainer.train() print(trainer.evaluate()) # {'eval_accuracy': 0.9332, 'eval_f1': 0.9331, ...}
superb and magnificent are near-synonyms — the baseline treats them as unrelated columns.Cost honesty: the baseline trains in ~10 seconds on a laptop CPU; DistilBERT wants a GPU and a few minutes per epoch. For a +3 to +4 point gain you pay roughly 100× the compute. Whether that trade is worth it is a real engineering decision — exactly the kind Session 25 asks you to weigh, including the energy footprint.
How well — and how do they fail?
A single accuracy number hides everything interesting. We report accuracy and macro-F1 side by side, draw the confusion matrix, and then read actual mistakes to understand why each model errs.
The math — precision, recall, F1
From the counts of true/false positives and negatives:
The F1 score is their harmonic mean — it punishes a model that wins on one at the expense of the other:
Macro-F1 averages the per-class F1 equally, so the rare class counts as much as the common one — the metric you'd trust if the deployment data were imbalanced.
Headline comparison
| Model | Accuracy | Macro-F1 | Train time | Params |
|---|---|---|---|---|
| TF-IDF + Logistic Regression | 0.8989 | 0.8988 | ~10 s (CPU) | ~50k |
| DistilBERT (fine-tuned, 2 ep.) | 0.9332 | 0.9331 | ~6 min (GPU) | 66M |
| Δ improvement | +3.43 | +3.43 | ≈100× cost | — |
A strong, transparent baseline lands at ~90% accuracy; fine-tuning lifts it to ~93%. The relative error reduction is meaningful (roughly a third of the baseline's mistakes erased), but the absolute jump is modest — a reminder that on long, opinion-rich documents, classical methods remain genuinely competitive.
Confusion matrix — DistilBERT on the 25k test set
Errors are roughly symmetric (917 vs 753), so the model isn't systematically biased toward one class. Compare this with the interactive confusion-matrix demo, which recomputes precision/recall/F1 live as you relabel data.
from sklearn.metrics import confusion_matrix, classification_report logits = trainer.predict(ds_tok["test"]).predictions bert_pred = logits.argmax(axis=-1) print(confusion_matrix(y_test, bert_pred)) # [[11583 917] # [ 753 11747]] print(classification_report(y_test, bert_pred, target_names=["neg", "pos"], digits=3))
Error analysis — where both models stumble
Reading the disagreements is the most useful 20 minutes of any project. The patterns below are typical of IMDb sentiment.
"The trailer was amazing and the cast is fantastic on paper, so I had huge hopes — what a letdown."
Why: the review is dense with positive vocabulary (amazing, fantastic,
huge hopes) describing expectations that get subverted. The baseline counts the positive words;
DistilBERT usually catches the what a letdown turn but not always.
"It shouldn't work. It's slow, the plot is thin, the budget was clearly nothing. And yet I loved every minute."
Why: a pile of negative cues followed by a late, understated reversal. Sarcasm and concessive structure ("and yet") are the hardest cases for both models; the baseline has no chance, the transformer wins this one only sometimes.
"Oh sure, a 'masterpiece'. Best comedy of the year — if you enjoy watching paint dry."
Why: pure sarcasm. Surface tokens are glowing; the meaning is the opposite. This is the canonical failure mode of bag-of-words sentiment and a known hard case even for large models.
What this project demonstrates
Each stage exercises specific sessions from the course. This is the table to put in your group-project report to show coverage.
Where to take it next
Strong directions for the group project, each building directly on the pipeline above.
Aspect-based sentiment
Move from "is this review positive?" to "how does it feel about the acting vs the plot vs the soundtrack?" Extract aspect terms, then classify sentiment per aspect — far more useful for product feedback.
Named-entity grounding
Run NER to attach opinions to the actors, directors and studios mentioned. "Negative toward the director, positive toward the lead" is richer than one global label.
LLM zero-shot baseline
Prompt a chat LLM ("Classify the sentiment: …") with no training at all. Compare its zero-shot accuracy and cost against your fine-tuned DistilBERT — often competitive, sometimes pricier per call.
Calibration & thresholds
Plot a reliability diagram and tune the decision threshold. For a moderation queue you may want high precision; for recall-sensitive triage, the opposite.
Robustness probing
Test against negation, typos, and adversarial edits. Where does each model break? This connects to the bias and robustness themes of Module 5.
Multilingual transfer
Swap DistilBERT for a multilingual checkpoint and test cross-lingual zero-shot: fine-tune on English, evaluate on Spanish reviews. How much sentiment signal transfers?
References & further reading
Course bibliography plus the primary sources behind the methods used here.