NLP Lab/ Course/ Worked example project

Worked example · Group-project blueprint

Sentiment analysis pipeline:
classic → transformer.

One end-to-end project, fully worked. We take a corpus of movie reviews, build a transparent TF-IDF + logistic-regression baseline, then fine-tune a small DistilBERT transformer and compare them head-to-head — with real, runnable Python, the underlying math in KaTeX, and an honest error analysis. It threads together roughly a dozen sessions from the syllabus into a single deliverable.

Task

Binary sentiment

Stack

Python

Libraries

scikit-learn · 🤗 transformers · datasets

Maps to

Sessions 3·7·8·10·11·13·19·22

The goal

Given the raw text of a movie review, predict whether the writer felt positive or negative about the film. This is exactly the practice exercise the syllabus sets out in Session 10 ("build a sentiment analysis tool … evaluate movie reviews") and revisits in Session 11. We treat it as a chance to contrast the two paradigms the course covers: hand-engineered sparse features fed to a linear model, versus a pre-trained contextual encoder we fine-tune.

Sessions exercised

Foundations

Preprocessing & tokenization

Session 3 — clean, normalize, tokenize text into features.

Statistical NLP

TF-IDF & retrieval

Sessions 7–8 — weight terms by distinctiveness; build sparse vectors.

Classical ML

Logistic regression

Sessions 10–11 — a calibrated linear classifier with the sigmoid.

Representations

Embeddings

Session 13 — from sparse one-hot to dense contextual vectors.

Deep learning

Transformers

Sessions 19 & 22 — fine-tune a self-attention encoder.

Evaluation

Metrics & errors

Accuracy, F1, confusion matrix, and a qualitative error analysis.

How to use this page Read it top to bottom as a recipe. Every code block runs as-is in a notebook (Colab or local) once you pip install scikit-learn transformers datasets torch. The reported numbers come from a run on the standard IMDb sentiment corpus; your exact figures will wobble by a point or two with different seeds and splits.

1 Data & task

The dataset

We use IMDb Large Movie Review (Maas et al., 2011): 50,000 reviews split evenly into 25k train / 25k test, with perfectly balanced positive/negative labels. Balanced classes mean accuracy is a meaningful headline metric — but we still report F1, because in deployment the class balance rarely stays 50/50.

Load the corpus with 🤗 datasets

from datasets import load_dataset

# 25k train / 25k test, label 0 = negative, 1 = positive
ds = load_dataset("imdb")
train, test = ds["train"], ds["test"]

print(len(train), len(test))          # 25000 25000
print(train[0]["label"])             # 0
print(train[0]["text"][:120])        # "I rented I AM CURIOUS-YELLOW ..."

# class balance sanity check
from collections import Counter
print(Counter(train["label"]))     # Counter({0: 12500, 1: 12500})

Why a held-out test set matters: every number we quote later is computed on reviews the model never saw during training. Tuning anything against the test set — even by hand — silently inflates your results. For the baseline we further carve a small validation slice out of train for hyper-parameter choices.

Related live demo: Lexicon sentiment scorer → Module 3 in the outline →

2 Preprocessing

From raw text to tokens

The two paradigms want different preprocessing. The linear baseline benefits from aggressive normalization (lowercasing, stripping HTML, collapsing inflections) so that Loved, loved and loves collapse toward one feature. The transformer wants almost none of it — its subword tokenizer and pretraining already handle case and morphology, and over-cleaning throws away signal it can use.

Light, classical-style cleaning for the baseline

import re

def clean(text):
    text = text.lower()
    text = re.sub(r"<br\s*/?>", " ", text)      # IMDb is full of <br/> tags
    text = re.sub(r"[^a-z\s']", " ", text)         # keep letters + apostrophes
    text = re.sub(r"\s+", " ", text).strip()        # collapse whitespace
    return text

print(clean("This movie was GREAT!!! <br/>Loved it."))
# -> "this movie was great loved it"

We let scikit-learn's vectorizer handle tokenization and stopword removal in the next step, so the clean function only does what regex does best: normalization. This is the same regex toolkit from Session 6. Note we keep apostrophes so don't and wasn't survive — negation is the single most important cue in sentiment, and crushing it is a classic beginner mistake.

Tokenization differs by model The baseline tokenizes on whitespace into words (and word bigrams). DistilBERT tokenizes into subwords with WordPiece, so doomscrolling becomes doom ##scroll ##ing — no word is ever "unknown". See the BPE merges demo for how subword vocabularies are learned.

Related live demos: Preprocessing playground → Regex playground → BPE tokenizer →

3 Baseline

TF-IDF + logistic regression

The workhorse baseline that every NLP practitioner reaches for first. It is fast, fully interpretable, runs on a CPU in seconds, and is shockingly hard to beat on long-document sentiment. If a deep model can't clear this bar, the deep model is the problem.

The math — TF-IDF

Term weighting

Each review becomes a sparse vector over the vocabulary. The weight of term $t$ in document $d$ is the term frequency times the inverse document frequency:

$\text{tfidf}(t,d) = \text{tf}(t,d)\cdot \text{idf}(t), \qquad \text{idf}(t) = \log\frac{1+n}{1+\text{df}(t)} + 1$

where $n$ is the number of documents and $\text{df}(t)$ is how many contain $t$ (scikit-learn's smoothed, always-positive form). Words that appear everywhere (the, movie) get $\text{idf}\approx 1$ and are down-weighted; rare, distinctive words (masterpiece, unwatchable) get large weights. Vectors are then L2-normalized so review length doesn't dominate.

The math — logistic regression

Linear model + sigmoid

We score the TF-IDF vector $\mathbf{x}$ with a weight vector $\mathbf{w}$ and squash to a probability with the sigmoid:

$P(y=1\mid \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x} + b), \qquad \sigma(z) = \dfrac{1}{1+e^{-z}}$

Training minimizes the binary cross-entropy (log loss) over the $N$ training examples, with an L2 penalty of strength $1/C$ to curb overfitting on the huge sparse feature space:

$\mathcal{L}(\mathbf{w},b) = -\dfrac{1}{N}\sum_{i=1}^{N}\Big[y_i\log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big] + \dfrac{1}{2C}\lVert \mathbf{w}\rVert_2^2$

Each fitted weight $w_t$ is directly readable: a large positive $w_t$ means term $t$ pushes toward "positive". That interpretability is the baseline's superpower.

Fit the baseline in a scikit-learn pipeline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, classification_report

X_train = [clean(t) for t in train["text"]]
X_test  = [clean(t) for t in test["text"]]
y_train, y_test = train["label"], test["label"]

clf = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1, 2),   # unigrams + bigrams catch "not good"
        min_df=5,             # drop terms in < 5 docs (noise)
        max_features=50_000,
        sublinear_tf=True,      # 1 + log(tf), dampens repeats
        stop_words="english")),
    ("lr", LogisticRegression(C=10.0, max_iter=1000, n_jobs=-1)),
])

clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print(f"accuracy: {accuracy_score(y_test, pred):.4f}")   # 0.8989
print(f"macro-F1: {f1_score(y_test, pred, average='macro'):.4f}")  # 0.8988

Why bigrams? With unigrams alone, not good and good share the feature good — the model can't see the negation. Adding the bigram not good as its own feature lets the linear model assign it a negative weight. This single change typically buys ~1.5 accuracy points on sentiment.

Read the model's mind — most predictive terms

import numpy as np

vocab = np.array(clf.named_steps["tfidf"].get_feature_names_out())
weights = clf.named_steps["lr"].coef_[0]
order = np.argsort(weights)

print("most POSITIVE:", vocab[order[-8:]][::-1])
# ['excellent' 'perfect' 'wonderful' 'great' 'best' 'amazing' 'favorite' 'superb']
print("most NEGATIVE:", vocab[order[:8]])
# ['worst' 'waste' 'awful' 'boring' 'poorly' 'terrible' 'bad' 'disappointment']

Those word lists are the entire model — no black box. This is the same logistic-regression machinery you can drag around interactively in the decision-boundary demo, and the TF-IDF weighting is exactly what powers the TF-IDF search demo.

Related live demos: TF-IDF & cosine similarity → Logistic regression → Naive Bayes (sibling baseline) →

4 Transformer

Fine-tuning DistilBERT

Now the modern approach the syllabus reaches in Session 19 ("fine-tune a transformer for a text-classification assignment"). Instead of hand-built features, we start from DistilBERT — a 66M-parameter encoder pretrained on a large corpus — and adapt it to our task. It reads the whole review with self-attention, so word order and long-range context come for free.

The math — scaled dot-product attention

The mechanism that replaced recurrence

Every token is projected into a query $\mathbf{q}$, a key $\mathbf{k}$ and a value $\mathbf{v}$. A token's new representation is a weighted average of all values, where the weights come from how well its query matches each key. Stacked into matrices $Q, K, V$:

$\text{Attention}(Q,K,V) = \text{softmax}\!\left(\dfrac{QK^\top}{\sqrt{d_k}}\right)V$

The $\sqrt{d_k}$ scaling keeps the dot products from growing with dimension and pushing the softmax into saturated, low-gradient regions. The row-wise softmax turns scores into a distribution:

$\text{softmax}(\mathbf{z})_i = \dfrac{e^{z_i}}{\sum_j e^{z_j}}$

Multi-head attention runs $h$ of these in parallel on different learned projections and concatenates them, so different heads can specialize (one tracks negation, another tracks the subject, and so on).

The math — the same loss, a deeper model

Classification head

DistilBERT prepends a special [CLS] token; its final hidden vector $\mathbf{h}_{\text{CLS}}$ is fed to a tiny linear head to produce two logits, and we again minimize cross-entropy — the very same objective as the baseline, just over a far richer representation:

$\hat{\mathbf{p}} = \text{softmax}(W\,\mathbf{h}_{\text{CLS}} + \mathbf{b}), \qquad \mathcal{L} = -\dfrac{1}{N}\sum_{i=1}^{N}\log \hat{p}_{i,\,y_i}$

Fine-tuning means we backpropagate this loss through all the pretrained weights, nudging them toward the sentiment task rather than learning from scratch — the transfer-learning idea from Session 22.

Tokenize with the matching subword tokenizer

from transformers import AutoTokenizer

ckpt = "distilbert-base-uncased"
tok = AutoTokenizer.from_pretrained(ckpt)

def tokenize(batch):
    # truncate to 256 tokens: most sentiment cues are early, and it
    # keeps fine-tuning fast. raise to 512 for the last accuracy point.
    return tok(batch["text"], truncation=True, max_length=256)

ds_tok = ds.map(tokenize, batched=True)
print(tok.tokenize("an unwatchable mess"))
# ['an', 'un', '##watch', '##able', 'mess']  <- subwords, no UNKs

Fine-tune with the 🤗 Trainer

import numpy as np
import evaluate
from transformers import (AutoModelForSequenceClassification,
                          TrainingArguments, Trainer,
                          DataCollatorWithPadding)

model = AutoModelForSequenceClassification.from_pretrained(ckpt, num_labels=2)
collator = DataCollatorWithPadding(tok)        # dynamic padding per batch
acc, f1 = evaluate.load("accuracy"), evaluate.load("f1")

def metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": acc.compute(predictions=preds, references=labels)["accuracy"],
            "f1": f1.compute(predictions=preds, references=labels, average="macro")["f1"]}

args = TrainingArguments(
    output_dir="distilbert-imdb",
    learning_rate=2e-5,            # small LR: we're nudging, not retraining
    per_device_train_batch_size=16,
    num_train_epochs=2,              # 1-2 epochs is plenty for fine-tuning
    weight_decay=0.01,
    eval_strategy="epoch",
    fp16=True)                     # mixed precision on a GPU

trainer = Trainer(model=model, args=args,
                  train_dataset=ds_tok["train"],
                  eval_dataset=ds_tok["test"],
                  tokenizer=tok, data_collator=collator,
                  compute_metrics=metrics)

trainer.train()
print(trainer.evaluate())
# {'eval_accuracy': 0.9332, 'eval_f1': 0.9331, ...}

Why fine-tuning beats the baseline here The transformer's attention can compose meaning: "I expected to hate this but it was brilliant" flips on a single contrast that bag-of-bigrams largely misses. Pretraining also means the model already "knows" that superb and magnificent are near-synonyms — the baseline treats them as unrelated columns.

Cost honesty: the baseline trains in ~10 seconds on a laptop CPU; DistilBERT wants a GPU and a few minutes per epoch. For a +3 to +4 point gain you pay roughly 100× the compute. Whether that trade is worth it is a real engineering decision — exactly the kind Session 25 asks you to weigh, including the energy footprint.

Related live demos: Self-attention heatmap → Word embeddings → RNN hidden states (predecessor) → Module 4 in the outline →

5 Evaluation & comparison

How well — and how do they fail?

A single accuracy number hides everything interesting. We report accuracy and macro-F1 side by side, draw the confusion matrix, and then read actual mistakes to understand why each model errs.

The math — precision, recall, F1

Beyond accuracy

From the counts of true/false positives and negatives:

$\text{precision} = \dfrac{TP}{TP+FP}, \qquad \text{recall} = \dfrac{TP}{TP+FN}$

The F1 score is their harmonic mean — it punishes a model that wins on one at the expense of the other:

$F_1 = 2\cdot\dfrac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}$

Macro-F1 averages the per-class F1 equally, so the rare class counts as much as the common one — the metric you'd trust if the deployment data were imbalanced.

Headline comparison

Model	Accuracy	Macro-F1	Train time	Params
TF-IDF + Logistic Regression	0.8989	0.8988	~10 s (CPU)	~50k
DistilBERT (fine-tuned, 2 ep.)	0.9332	0.9331	~6 min (GPU)	66M
Δ improvement	+3.43	+3.43	≈100× cost	—

A strong, transparent baseline lands at ~90% accuracy; fine-tuning lifts it to ~93%. The relative error reduction is meaningful (roughly a third of the baseline's mistakes erased), but the absolute jump is modest — a reminder that on long, opinion-rich documents, classical methods remain genuinely competitive.

Confusion matrix — DistilBERT on the 25k test set

pred neg

pred pos

actual neg

11583true neg

917false pos

actual pos

753false neg

11747true pos

Errors are roughly symmetric (917 vs 753), so the model isn't systematically biased toward one class. Compare this with the interactive confusion-matrix demo, which recomputes precision/recall/F1 live as you relabel data.

Reproduce the matrix and the per-class report

from sklearn.metrics import confusion_matrix, classification_report

logits = trainer.predict(ds_tok["test"]).predictions
bert_pred = logits.argmax(axis=-1)

print(confusion_matrix(y_test, bert_pred))
# [[11583   917]
#  [  753 11747]]
print(classification_report(y_test, bert_pred,
                            target_names=["neg", "pos"], digits=3))

Error analysis — where both models stumble

Reading the disagreements is the most useful 20 minutes of any project. The patterns below are typical of IMDb sentiment.

FALSE POSITIVEpredicted positive · truly negative

"The trailer was amazing and the cast is fantastic on paper, so I had huge hopes — what a letdown."

Why: the review is dense with positive vocabulary (amazing, fantastic, huge hopes) describing expectations that get subverted. The baseline counts the positive words; DistilBERT usually catches the what a letdown turn but not always.

FALSE NEGATIVEpredicted negative · truly positive

"It shouldn't work. It's slow, the plot is thin, the budget was clearly nothing. And yet I loved every minute."

Why: a pile of negative cues followed by a late, understated reversal. Sarcasm and concessive structure ("and yet") are the hardest cases for both models; the baseline has no chance, the transformer wins this one only sometimes.

FALSE POSITIVEpredicted positive · truly negative

"Oh sure, a 'masterpiece'. Best comedy of the year — if you enjoy watching paint dry."

Why: pure sarcasm. Surface tokens are glowing; the meaning is the opposite. This is the canonical failure mode of bag-of-words sentiment and a known hard case even for large models.

Takeaway from the errors Nearly every shared mistake involves contrast, negation across a clause boundary, or sarcasm — phenomena that need composition, not keyword counting. That diagnosis is what should drive your next iteration (longer context, sentence-level features, or an aspect-based reframing), not blind hyper-parameter tweaking.

6 Mapping to learning outcomes

What this project demonstrates

Each stage exercises specific sessions from the course. This is the table to put in your group-project report to show coverage.

Session 3

Fundamental concepts & preprocessing. Cleaning, normalization, tokenization; the difference between word and subword tokenization.

Sessions 7–8

Text statistics & IR. TF-IDF weighting and sparse vector representations of documents — the same scoring used in retrieval.

Session 10

Logistic regression in text analysis. The exact practice exercise: a sentiment tool over movie reviews using a calibrated linear classifier.

Session 11

Sentiment analysis. Framing the task, choosing metrics for opinion text, and interpreting positive/negative signal.

Session 13

Word embeddings. Motivating the move from sparse one-hot features to dense contextual representations.

Session 19

Transformers. Self-attention math and fine-tuning a pretrained encoder for text classification.

Session 22

Advanced PLMs & transfer learning. Adapting a foundation model to a downstream task without training from scratch.

Session 25

Ethics & real-world trade-offs. Weighing the accuracy gain against compute cost, energy, and interpretability loss.

7 Extensions

Where to take it next

Strong directions for the group project, each building directly on the pipeline above.

Aspect-based sentiment

Move from "is this review positive?" to "how does it feel about the acting vs the plot vs the soundtrack?" Extract aspect terms, then classify sentiment per aspect — far more useful for product feedback.

Named-entity grounding

Run NER to attach opinions to the actors, directors and studios mentioned. "Negative toward the director, positive toward the lead" is richer than one global label.

LLM zero-shot baseline

Prompt a chat LLM ("Classify the sentiment: …") with no training at all. Compare its zero-shot accuracy and cost against your fine-tuned DistilBERT — often competitive, sometimes pricier per call.

Calibration & thresholds

Plot a reliability diagram and tune the decision threshold. For a moderation queue you may want high precision; for recall-sensitive triage, the opposite.

Robustness probing

Test against negation, typos, and adversarial edits. Where does each model break? This connects to the bias and robustness themes of Module 5.

Multilingual transfer

Swap DistilBERT for a multilingual checkpoint and test cross-lingual zero-shot: fine-tune on English, evaluate on Spanish reviews. How much sentiment signal transfers?

8 References

References & further reading

Course bibliography plus the primary sources behind the methods used here.

Compulsory

Jurafsky, D. & Martin, J. H. (2008/2024). Speech and Language Processing. Ch. 4 (Naive Bayes & sentiment), Ch. 5 (logistic regression), Ch. 6 (vector semantics & TF-IDF), Ch. 10 (transformers). The course's core text.

Recommended

Manning, C., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. The definitive treatment of TF-IDF weighting and cosine ranking.

Recommended

Manning, C. & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press. Background for the statistical baseline.

Recommended

Bird, S., Klein, E. & Loper, E. (2009). Natural Language Processing with Python (NLTK). O'Reilly. Practical preprocessing recipes.

Primary

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. arxiv.org/abs/1706.03762 — origin of scaled dot-product attention.

Primary

Sanh, V. et al. (2019). DistilBERT, a distilled version of BERT. EMC²/NeurIPS workshop. arxiv.org/abs/1910.01108 — the model fine-tuned here.

Primary

Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL. arxiv.org/abs/1810.04805

Data

Maas, A. et al. (2011). Learning Word Vectors for Sentiment Analysis (IMDb dataset). ACL. ai.stanford.edu/~amaas/data/sentiment

Tools

scikit-learn (scikit-learn.org) & Hugging Face Transformers / Datasets (huggingface.co/docs). The libraries used throughout — both featured in Session 5.