NLP Lab/ Course/ Worked example project
Worked example · Group-project blueprint

Sentiment analysis pipeline:
classic → transformer.

One end-to-end project, fully worked. We take a corpus of movie reviews, build a transparent TF-IDF + logistic-regression baseline, then fine-tune a small DistilBERT transformer and compare them head-to-head — with real, runnable Python, the underlying math in KaTeX, and an honest error analysis. It threads together roughly a dozen sessions from the syllabus into a single deliverable.

Task
Binary sentiment
Stack
Python
Libraries
scikit-learn · 🤗 transformers · datasets
Maps to
Sessions 3·7·8·10·11·13·19·22

The goal

Given the raw text of a movie review, predict whether the writer felt positive or negative about the film. This is exactly the practice exercise the syllabus sets out in Session 10 ("build a sentiment analysis tool … evaluate movie reviews") and revisits in Session 11. We treat it as a chance to contrast the two paradigms the course covers: hand-engineered sparse features fed to a linear model, versus a pre-trained contextual encoder we fine-tune.

Sessions exercised

Foundations

Preprocessing & tokenization

Session 3 — clean, normalize, tokenize text into features.

Statistical NLP

TF-IDF & retrieval

Sessions 7–8 — weight terms by distinctiveness; build sparse vectors.

Classical ML

Logistic regression

Sessions 10–11 — a calibrated linear classifier with the sigmoid.

Representations

Embeddings

Session 13 — from sparse one-hot to dense contextual vectors.

Deep learning

Transformers

Sessions 19 & 22 — fine-tune a self-attention encoder.

Evaluation

Metrics & errors

Accuracy, F1, confusion matrix, and a qualitative error analysis.

How to use this page Read it top to bottom as a recipe. Every code block runs as-is in a notebook (Colab or local) once you pip install scikit-learn transformers datasets torch. The reported numbers come from a run on the standard IMDb sentiment corpus; your exact figures will wobble by a point or two with different seeds and splits.
1 Data & task

The dataset

We use IMDb Large Movie Review (Maas et al., 2011): 50,000 reviews split evenly into 25k train / 25k test, with perfectly balanced positive/negative labels. Balanced classes mean accuracy is a meaningful headline metric — but we still report F1, because in deployment the class balance rarely stays 50/50.

Load the corpus with 🤗 datasets
from datasets import load_dataset

# 25k train / 25k test, label 0 = negative, 1 = positive
ds = load_dataset("imdb")
train, test = ds["train"], ds["test"]

print(len(train), len(test))          # 25000 25000
print(train[0]["label"])             # 0
print(train[0]["text"][:120])        # "I rented I AM CURIOUS-YELLOW ..."

# class balance sanity check
from collections import Counter
print(Counter(train["label"]))     # Counter({0: 12500, 1: 12500})

Why a held-out test set matters: every number we quote later is computed on reviews the model never saw during training. Tuning anything against the test set — even by hand — silently inflates your results. For the baseline we further carve a small validation slice out of train for hyper-parameter choices.

2 Preprocessing

From raw text to tokens

The two paradigms want different preprocessing. The linear baseline benefits from aggressive normalization (lowercasing, stripping HTML, collapsing inflections) so that Loved, loved and loves collapse toward one feature. The transformer wants almost none of it — its subword tokenizer and pretraining already handle case and morphology, and over-cleaning throws away signal it can use.

Light, classical-style cleaning for the baseline
import re

def clean(text):
    text = text.lower()
    text = re.sub(r"<br\s*/?>", " ", text)      # IMDb is full of <br/> tags
    text = re.sub(r"[^a-z\s']", " ", text)         # keep letters + apostrophes
    text = re.sub(r"\s+", " ", text).strip()        # collapse whitespace
    return text

print(clean("This movie was GREAT!!! <br/>Loved it."))
# -> "this movie was great loved it"

We let scikit-learn's vectorizer handle tokenization and stopword removal in the next step, so the clean function only does what regex does best: normalization. This is the same regex toolkit from Session 6. Note we keep apostrophes so don't and wasn't survive — negation is the single most important cue in sentiment, and crushing it is a classic beginner mistake.

Tokenization differs by model The baseline tokenizes on whitespace into words (and word bigrams). DistilBERT tokenizes into subwords with WordPiece, so doomscrolling becomes doom ##scroll ##ing — no word is ever "unknown". See the BPE merges demo for how subword vocabularies are learned.

3 Baseline

TF-IDF + logistic regression

The workhorse baseline that every NLP practitioner reaches for first. It is fast, fully interpretable, runs on a CPU in seconds, and is shockingly hard to beat on long-document sentiment. If a deep model can't clear this bar, the deep model is the problem.

The math — TF-IDF

Term weighting

Each review becomes a sparse vector over the vocabulary. The weight of term $t$ in document $d$ is the term frequency times the inverse document frequency:

$\text{tfidf}(t,d) = \text{tf}(t,d)\cdot \text{idf}(t), \qquad \text{idf}(t) = \log\frac{1+n}{1+\text{df}(t)} + 1$

where $n$ is the number of documents and $\text{df}(t)$ is how many contain $t$ (scikit-learn's smoothed, always-positive form). Words that appear everywhere (the, movie) get $\text{idf}\approx 1$ and are down-weighted; rare, distinctive words (masterpiece, unwatchable) get large weights. Vectors are then L2-normalized so review length doesn't dominate.

The math — logistic regression

Linear model + sigmoid

We score the TF-IDF vector $\mathbf{x}$ with a weight vector $\mathbf{w}$ and squash to a probability with the sigmoid:

$P(y=1\mid \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x} + b), \qquad \sigma(z) = \dfrac{1}{1+e^{-z}}$

Training minimizes the binary cross-entropy (log loss) over the $N$ training examples, with an L2 penalty of strength $1/C$ to curb overfitting on the huge sparse feature space:

$\mathcal{L}(\mathbf{w},b) = -\dfrac{1}{N}\sum_{i=1}^{N}\Big[y_i\log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big] + \dfrac{1}{2C}\lVert \mathbf{w}\rVert_2^2$

Each fitted weight $w_t$ is directly readable: a large positive $w_t$ means term $t$ pushes toward "positive". That interpretability is the baseline's superpower.

Fit the baseline in a scikit-learn pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, classification_report

X_train = [clean(t) for t in train["text"]]
X_test  = [clean(t) for t in test["text"]]
y_train, y_test = train["label"], test["label"]

clf = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),   # unigrams + bigrams catch "not good"
        min_df=5,             # drop terms in < 5 docs (noise)
        max_features=50_000,
        sublinear_tf=True,      # 1 + log(tf), dampens repeats
        stop_words="english")),
    ("lr", LogisticRegression(C=10.0, max_iter=1000, n_jobs=-1)),
])

clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print(f"accuracy: {accuracy_score(y_test, pred):.4f}")   # 0.8989
print(f"macro-F1: {f1_score(y_test, pred, average='macro'):.4f}")  # 0.8988
Why bigrams? With unigrams alone, not good and good share the feature good — the model can't see the negation. Adding the bigram not good as its own feature lets the linear model assign it a negative weight. This single change typically buys ~1.5 accuracy points on sentiment.
Read the model's mind — most predictive terms
import numpy as np

vocab = np.array(clf.named_steps["tfidf"].get_feature_names_out())
weights = clf.named_steps["lr"].coef_[0]
order = np.argsort(weights)

print("most POSITIVE:", vocab[order[-8:]][::-1])
# ['excellent' 'perfect' 'wonderful' 'great' 'best' 'amazing' 'favorite' 'superb']
print("most NEGATIVE:", vocab[order[:8]])
# ['worst' 'waste' 'awful' 'boring' 'poorly' 'terrible' 'bad' 'disappointment']

Those word lists are the entire model — no black box. This is the same logistic-regression machinery you can drag around interactively in the decision-boundary demo, and the TF-IDF weighting is exactly what powers the TF-IDF search demo.

4 Transformer

Fine-tuning DistilBERT

Now the modern approach the syllabus reaches in Session 19 ("fine-tune a transformer for a text-classification assignment"). Instead of hand-built features, we start from DistilBERT — a 66M-parameter encoder pretrained on a large corpus — and adapt it to our task. It reads the whole review with self-attention, so word order and long-range context come for free.

The math — scaled dot-product attention

The mechanism that replaced recurrence

Every token is projected into a query $\mathbf{q}$, a key $\mathbf{k}$ and a value $\mathbf{v}$. A token's new representation is a weighted average of all values, where the weights come from how well its query matches each key. Stacked into matrices $Q, K, V$:

$\text{Attention}(Q,K,V) = \text{softmax}\!\left(\dfrac{QK^\top}{\sqrt{d_k}}\right)V$

The $\sqrt{d_k}$ scaling keeps the dot products from growing with dimension and pushing the softmax into saturated, low-gradient regions. The row-wise softmax turns scores into a distribution:

$\text{softmax}(\mathbf{z})_i = \dfrac{e^{z_i}}{\sum_j e^{z_j}}$

Multi-head attention runs $h$ of these in parallel on different learned projections and concatenates them, so different heads can specialize (one tracks negation, another tracks the subject, and so on).

The math — the same loss, a deeper model

Classification head

DistilBERT prepends a special [CLS] token; its final hidden vector $\mathbf{h}_{\text{CLS}}$ is fed to a tiny linear head to produce two logits, and we again minimize cross-entropy — the very same objective as the baseline, just over a far richer representation:

$\hat{\mathbf{p}} = \text{softmax}(W\,\mathbf{h}_{\text{CLS}} + \mathbf{b}), \qquad \mathcal{L} = -\dfrac{1}{N}\sum_{i=1}^{N}\log \hat{p}_{i,\,y_i}$

Fine-tuning means we backpropagate this loss through all the pretrained weights, nudging them toward the sentiment task rather than learning from scratch — the transfer-learning idea from Session 22.

Tokenize with the matching subword tokenizer
from transformers import AutoTokenizer

ckpt = "distilbert-base-uncased"
tok = AutoTokenizer.from_pretrained(ckpt)

def tokenize(batch):
    # truncate to 256 tokens: most sentiment cues are early, and it
    # keeps fine-tuning fast. raise to 512 for the last accuracy point.
    return tok(batch["text"], truncation=True, max_length=256)

ds_tok = ds.map(tokenize, batched=True)
print(tok.tokenize("an unwatchable mess"))
# ['an', 'un', '##watch', '##able', 'mess']  <- subwords, no UNKs
Fine-tune with the 🤗 Trainer
import numpy as np
import evaluate
from transformers import (AutoModelForSequenceClassification,
                          TrainingArguments, Trainer,
                          DataCollatorWithPadding)

model = AutoModelForSequenceClassification.from_pretrained(ckpt, num_labels=2)
collator = DataCollatorWithPadding(tok)        # dynamic padding per batch
acc, f1 = evaluate.load("accuracy"), evaluate.load("f1")

def metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": acc.compute(predictions=preds, references=labels)["accuracy"],
            "f1": f1.compute(predictions=preds, references=labels, average="macro")["f1"]}

args = TrainingArguments(
    output_dir="distilbert-imdb",
    learning_rate=2e-5,            # small LR: we're nudging, not retraining
    per_device_train_batch_size=16,
    num_train_epochs=2,              # 1-2 epochs is plenty for fine-tuning
    weight_decay=0.01,
    eval_strategy="epoch",
    fp16=True)                     # mixed precision on a GPU

trainer = Trainer(model=model, args=args,
                  train_dataset=ds_tok["train"],
                  eval_dataset=ds_tok["test"],
                  tokenizer=tok, data_collator=collator,
                  compute_metrics=metrics)

trainer.train()
print(trainer.evaluate())
# {'eval_accuracy': 0.9332, 'eval_f1': 0.9331, ...}
Why fine-tuning beats the baseline here The transformer's attention can compose meaning: "I expected to hate this but it was brilliant" flips on a single contrast that bag-of-bigrams largely misses. Pretraining also means the model already "knows" that superb and magnificent are near-synonyms — the baseline treats them as unrelated columns.

Cost honesty: the baseline trains in ~10 seconds on a laptop CPU; DistilBERT wants a GPU and a few minutes per epoch. For a +3 to +4 point gain you pay roughly 100× the compute. Whether that trade is worth it is a real engineering decision — exactly the kind Session 25 asks you to weigh, including the energy footprint.

5 Evaluation & comparison

How well — and how do they fail?

A single accuracy number hides everything interesting. We report accuracy and macro-F1 side by side, draw the confusion matrix, and then read actual mistakes to understand why each model errs.

The math — precision, recall, F1

Beyond accuracy

From the counts of true/false positives and negatives:

$\text{precision} = \dfrac{TP}{TP+FP}, \qquad \text{recall} = \dfrac{TP}{TP+FN}$

The F1 score is their harmonic mean — it punishes a model that wins on one at the expense of the other:

$F_1 = 2\cdot\dfrac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}$

Macro-F1 averages the per-class F1 equally, so the rare class counts as much as the common one — the metric you'd trust if the deployment data were imbalanced.

Headline comparison

ModelAccuracyMacro-F1Train timeParams
TF-IDF + Logistic Regression0.89890.8988~10 s (CPU)~50k
DistilBERT (fine-tuned, 2 ep.)0.93320.9331~6 min (GPU)66M
Δ improvement+3.43+3.43≈100× cost

A strong, transparent baseline lands at ~90% accuracy; fine-tuning lifts it to ~93%. The relative error reduction is meaningful (roughly a third of the baseline's mistakes erased), but the absolute jump is modest — a reminder that on long, opinion-rich documents, classical methods remain genuinely competitive.

Confusion matrix — DistilBERT on the 25k test set

pred neg
pred pos
actual neg
11583true neg
917false pos
actual pos
753false neg
11747true pos

Errors are roughly symmetric (917 vs 753), so the model isn't systematically biased toward one class. Compare this with the interactive confusion-matrix demo, which recomputes precision/recall/F1 live as you relabel data.

Reproduce the matrix and the per-class report
from sklearn.metrics import confusion_matrix, classification_report

logits = trainer.predict(ds_tok["test"]).predictions
bert_pred = logits.argmax(axis=-1)

print(confusion_matrix(y_test, bert_pred))
# [[11583   917]
#  [  753 11747]]
print(classification_report(y_test, bert_pred,
                            target_names=["neg", "pos"], digits=3))

Error analysis — where both models stumble

Reading the disagreements is the most useful 20 minutes of any project. The patterns below are typical of IMDb sentiment.

FALSE POSITIVEpredicted positive · truly negative

"The trailer was amazing and the cast is fantastic on paper, so I had huge hopes — what a letdown."

Why: the review is dense with positive vocabulary (amazing, fantastic, huge hopes) describing expectations that get subverted. The baseline counts the positive words; DistilBERT usually catches the what a letdown turn but not always.

FALSE NEGATIVEpredicted negative · truly positive

"It shouldn't work. It's slow, the plot is thin, the budget was clearly nothing. And yet I loved every minute."

Why: a pile of negative cues followed by a late, understated reversal. Sarcasm and concessive structure ("and yet") are the hardest cases for both models; the baseline has no chance, the transformer wins this one only sometimes.

FALSE POSITIVEpredicted positive · truly negative

"Oh sure, a 'masterpiece'. Best comedy of the year — if you enjoy watching paint dry."

Why: pure sarcasm. Surface tokens are glowing; the meaning is the opposite. This is the canonical failure mode of bag-of-words sentiment and a known hard case even for large models.

Takeaway from the errors Nearly every shared mistake involves contrast, negation across a clause boundary, or sarcasm — phenomena that need composition, not keyword counting. That diagnosis is what should drive your next iteration (longer context, sentence-level features, or an aspect-based reframing), not blind hyper-parameter tweaking.
6 Mapping to learning outcomes

What this project demonstrates

Each stage exercises specific sessions from the course. This is the table to put in your group-project report to show coverage.

Fundamental concepts & preprocessing. Cleaning, normalization, tokenization; the difference between word and subword tokenization.
Text statistics & IR. TF-IDF weighting and sparse vector representations of documents — the same scoring used in retrieval.
Logistic regression in text analysis. The exact practice exercise: a sentiment tool over movie reviews using a calibrated linear classifier.
Sentiment analysis. Framing the task, choosing metrics for opinion text, and interpreting positive/negative signal.
Word embeddings. Motivating the move from sparse one-hot features to dense contextual representations.
Transformers. Self-attention math and fine-tuning a pretrained encoder for text classification.
Advanced PLMs & transfer learning. Adapting a foundation model to a downstream task without training from scratch.
Ethics & real-world trade-offs. Weighing the accuracy gain against compute cost, energy, and interpretability loss.
7 Extensions

Where to take it next

Strong directions for the group project, each building directly on the pipeline above.

Aspect-based sentiment

Move from "is this review positive?" to "how does it feel about the acting vs the plot vs the soundtrack?" Extract aspect terms, then classify sentiment per aspect — far more useful for product feedback.

Named-entity grounding

Run NER to attach opinions to the actors, directors and studios mentioned. "Negative toward the director, positive toward the lead" is richer than one global label.

LLM zero-shot baseline

Prompt a chat LLM ("Classify the sentiment: …") with no training at all. Compare its zero-shot accuracy and cost against your fine-tuned DistilBERT — often competitive, sometimes pricier per call.

Calibration & thresholds

Plot a reliability diagram and tune the decision threshold. For a moderation queue you may want high precision; for recall-sensitive triage, the opposite.

Robustness probing

Test against negation, typos, and adversarial edits. Where does each model break? This connects to the bias and robustness themes of Module 5.

Multilingual transfer

Swap DistilBERT for a multilingual checkpoint and test cross-lingual zero-shot: fine-tune on English, evaluate on Spanish reviews. How much sentiment signal transfers?

8 References

References & further reading

Course bibliography plus the primary sources behind the methods used here.

Compulsory
Jurafsky, D. & Martin, J. H. (2008/2024). Speech and Language Processing. Ch. 4 (Naive Bayes & sentiment), Ch. 5 (logistic regression), Ch. 6 (vector semantics & TF-IDF), Ch. 10 (transformers). The course's core text.
Recommended
Manning, C., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. The definitive treatment of TF-IDF weighting and cosine ranking.
Recommended
Manning, C. & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press. Background for the statistical baseline.
Recommended
Bird, S., Klein, E. & Loper, E. (2009). Natural Language Processing with Python (NLTK). O'Reilly. Practical preprocessing recipes.
Primary
Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. arxiv.org/abs/1706.03762 — origin of scaled dot-product attention.
Primary
Sanh, V. et al. (2019). DistilBERT, a distilled version of BERT. EMC²/NeurIPS workshop. arxiv.org/abs/1910.01108 — the model fine-tuned here.
Primary
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL. arxiv.org/abs/1810.04805
Data
Maas, A. et al. (2011). Learning Word Vectors for Sentiment Analysis (IMDb dataset). ACL. ai.stanford.edu/~amaas/data/sentiment
Tools
scikit-learn (scikit-learn.org) & Hugging Face Transformers / Datasets (huggingface.co/docs). The libraries used throughout — both featured in Session 5.