Document Classifier — Built From Scratch

What it is

The foundational text task, done properly

Document classification sounds mundane until you build it: take free-form text of any length and assign it a label — spam or not, news topic, ticket department, sentiment. Almost every practical NLP system has this shape somewhere inside it, which is exactly why it's worth building carefully rather than importing.

This build takes a real corpus — a 5-category slice of 20 Newsgroups — and turns each document into TF-IDF vectors, then trains two classifiers on identical features: a logistic-regression baseline, and a Multinomial Naive Bayes implemented from scratch in numpy. Both are scored on a held-out test split so the comparison is honest, and headers/footers/quotes are stripped so the model learns from prose, not metadata leakage.

bag → vector

the core move: turning unbounded, unstructured text into a fixed-length vector a classifier can learn from.

comp.sys.mac.hardwarerec.autossci.medsci.spacetalk.politics.guns

The stack

From raw text to a label

Two models, identical features, one honest comparison.

data

20 Newsgroups

A real labelled corpus, 5 categories, split into 2,905 train / 1,935 test documents — the ground truth.

clean

Strip + tokenise

Drop headers, footers and quotes, lowercase, remove English stopwords — so the model reads prose, not metadata.

vectorise

TF-IDF

Weight uni- and bi-grams by how distinctive they are to a document — interpretable and surprisingly strong.

model · 1

LogisticRegression

An L2-regularised linear baseline whose weights you can actually read.

model · 2

Naive Bayes, from scratch

The multinomial event model with Laplace smoothing, written in numpy in log space — no sklearn estimator under the hood.

evaluate

Precision / recall / F1

Per-class metrics and a confusion matrix on the held-out test split — accuracy alone lies.

Architecture

How a document is classified

Every document follows the same represent-then-decide pipeline:

Preprocess
Tokenise and clean the raw text into a consistent form.
Vectorise
Turn the document into a fixed-length TF-IDF vector — uni- and bi-grams, stopwords removed.
Train
Fit a classifier on the labelled training vectors.
Predict
Assign a category to each unseen document.
Evaluate
Score per-class performance and read the confusion matrix for blind spots.

Results

Real numbers on the test split

Measured on the 1,935 held-out test documents — not the training data. The random baseline for five balanced classes is 20%. The two models train on identical TF-IDF features; on this slice the from-scratch Naive Bayes actually edges out logistic regression, a reminder that the classic interpretable approach is hard to beat on bag-of-words text.

Model	Accuracy	Macro-F1
TF-IDF + LogisticRegression	84.91%	0.8506
Multinomial Naive Bayes (from scratch)	87.44%	0.8758

Confusion matrices for both models on the 20 Newsgroups test split — Confusion matrices on the held-out test set. Strong diagonals; the main confusion is `rec.autos` ↔ the science groups, where vocabulary overlaps.

Reproduce: pip install -r requirements.txt then python train.py. Numbers above are written to results.json by that run; the test suite asserts they clear a sane floor.

Try it

Classify a piece of text

Train once, then point predict.py at any text — inline, a file, or piped over stdin. It prints the predicted category and the class probabilities.

# train both models, then classify new text
python train.py
echo "NASA launched a rocket to study the moon and planets" | python predict.py

# → Predicted category: sci.space
#    sci.space   99.67%  ##############################

Reflection

What building it taught me

Naive Bayes is no toy. Written carefully in log space with Laplace smoothing, it actually beat logistic regression here — the classic bag-of-words approach is genuinely competitive.
Implementing the maths clarifies it. Coding the multinomial likelihood by hand, then checking it matches sklearn to within 1e-8, taught me more than any tutorial would have.
Metadata leaks. Leaving the newsgroup headers in inflates accuracy for free; stripping headers, footers and quotes is what makes the numbers honest.
Accuracy is a trap on imbalance. A classifier can score high and still be useless on a class; per-class F1 and the confusion matrix keep you honest.
Master this, generalise everywhere. Spam, topics, sentiment, routing — once the pipeline is yours, the task just changes labels.