From-Scratch Build · Natural Language Processing
Your phone buzzes forty times a day and maybe two of them actually matter. This Python build reads the content of each notification, scores it on a continuous priority scale, and returns a ranked inbox — so the security alert rises and the marketing blast sinks. A small, honest NLP problem with an outsized effect on attention.
What it is
Notifications arrive in a flat, undifferentiated stream — a flight delay sits next to a game invite sits next to a two-factor code. The system here reads what each one actually says and assigns it a priority, so the stream can be reordered by importance instead of arrival time. It's triage: not deleting anything, just deciding what deserves your attention first.
The interesting part is that priority lives in the language. "Your account was accessed from a new device" and "New devices are on sale" share words but not urgency. Telling them apart needs a semantic representation, not keyword spotting — which is exactly what makes it a good study.
The stack
Encode the message, score it, order the feed — all in scikit-learn.
A hand-built seed set tagged high / medium / low (20 / 20 / 25), committed under data/.
Uni/bi-gram TF-IDF compressed by truncated SVD into a ~40-dim semantic vector — Latent Semantic Analysis.
Interpretable cues: urgency-lexicon hits, ALL-CAPS ratio, codes, money, deadline mentions, sender weight.
Embedding + signals concatenated, mapped to P(low/medium/high). Trains in under a second on CPU.
Probabilities collapse to one priority score in [0,1]; the inbox sorts by it as a strict total order.
Cosine similarity in LSA space to a user's important topics nudges relevant messages up — semantic, not keyword.
Architecture
Each incoming message runs the same read-score-place pipeline:
Capture the notification's text and sender as it arrives.
TF-IDF (1,2-grams) then truncated SVD into a ~40-dim LSA vector.
Add the lexical layer — urgency hits, caps, codes, money, deadlines, sender weight — and concatenate.
Logistic regression outputs class probabilities, collapsed to one priority score in [0,1].
Sort the feed by score into a strict total order, ties broken by arrival index.
Real output
python demo.py feeds eight notifications — a security alert, a password reset, a newsletter, a social like, a calendar reminder, a production-down page, a sale, a delivery — and prints the ranked inbox. This is the actual output:
Honest by design
There's no large language model here. "Semantic analysis" means TF-IDF + Latent Semantic Analysis (truncated SVD over the term-document matrix) for a dense embedding, plus a small lexical urgency layer — both fed to a logistic-regression classifier in scikit-learn.
That's enough to separate "your account was accessed from a new device" from "new devices are on sale", it trains in under a second, runs entirely on CPU, and every feature is inspectable. The whole thing is small on purpose.
Reflection