From-Scratch Build · Natural Language Processing

Document Classifier

Drop in a piece of text — a forum post, an article, a support ticket — and get it sorted into its category, from the words alone. It's the workhorse task of applied NLP, built end to end here on the real 20 Newsgroups corpus: TF-IDF features, a logistic-regression baseline, and a Multinomial Naive Bayes I wrote from scratch in numpy to compare against it.

PythonTF-IDFNaive Bayes LogisticRegressionscikit-learn20 Newsgroups

What it is

The foundational text task, done properly

Document classification sounds mundane until you build it: take free-form text of any length and assign it a label — spam or not, news topic, ticket department, sentiment. Almost every practical NLP system has this shape somewhere inside it, which is exactly why it's worth building carefully rather than importing.

This build takes a real corpus — a 5-category slice of 20 Newsgroups — and turns each document into TF-IDF vectors, then trains two classifiers on identical features: a logistic-regression baseline, and a Multinomial Naive Bayes implemented from scratch in numpy. Both are scored on a held-out test split so the comparison is honest, and headers/footers/quotes are stripped so the model learns from prose, not metadata leakage.

bag → vector
the core move: turning unbounded, unstructured text into a fixed-length vector a classifier can learn from.
comp.sys.mac.hardwarerec.autossci.medsci.spacetalk.politics.guns

The stack

From raw text to a label

Two models, identical features, one honest comparison.

data

20 Newsgroups

A real labelled corpus, 5 categories, split into 2,905 train / 1,935 test documents — the ground truth.

clean

Strip + tokenise

Drop headers, footers and quotes, lowercase, remove English stopwords — so the model reads prose, not metadata.

vectorise

TF-IDF

Weight uni- and bi-grams by how distinctive they are to a document — interpretable and surprisingly strong.

model · 1

LogisticRegression

An L2-regularised linear baseline whose weights you can actually read.

model · 2

Naive Bayes, from scratch

The multinomial event model with Laplace smoothing, written in numpy in log space — no sklearn estimator under the hood.

evaluate

Precision / recall / F1

Per-class metrics and a confusion matrix on the held-out test split — accuracy alone lies.

Architecture

How a document is classified

Every document follows the same represent-then-decide pipeline:

  1. Preprocess

    Tokenise and clean the raw text into a consistent form.

  2. Vectorise

    Turn the document into a fixed-length TF-IDF vector — uni- and bi-grams, stopwords removed.

  3. Train

    Fit a classifier on the labelled training vectors.

  4. Predict

    Assign a category to each unseen document.

  5. Evaluate

    Score per-class performance and read the confusion matrix for blind spots.

Results

Real numbers on the test split

Measured on the 1,935 held-out test documents — not the training data. The random baseline for five balanced classes is 20%. The two models train on identical TF-IDF features; on this slice the from-scratch Naive Bayes actually edges out logistic regression, a reminder that the classic interpretable approach is hard to beat on bag-of-words text.

ModelAccuracyMacro-F1
TF-IDF + LogisticRegression84.91%0.8506
Multinomial Naive Bayes (from scratch)87.44%0.8758
Confusion matrices for both models on the 20 Newsgroups test split
Confusion matrices on the held-out test set. Strong diagonals; the main confusion is rec.autos ↔ the science groups, where vocabulary overlaps.

Reproduce: pip install -r requirements.txt then python train.py. Numbers above are written to results.json by that run; the test suite asserts they clear a sane floor.

Try it

Classify a piece of text

Train once, then point predict.py at any text — inline, a file, or piped over stdin. It prints the predicted category and the class probabilities.

# train both models, then classify new text
python train.py
echo "NASA launched a rocket to study the moon and planets" | python predict.py

# → Predicted category: sci.space
#    sci.space   99.67%  ##############################

Reflection

What building it taught me