From-Scratch Build · Natural Language Processing
Why does one short video explode and a near-identical one sink? Part of the answer is hiding in the caption — not the hashtags, the meaning. This build predicts engagement from the semantics of the text, turning captions into embeddings and learning what kinds of meaning travel. I rebuilt it from scratch to see how far language alone can explain virality.
What it is
Most engagement models lean on surface signals — follower count, posting time, hashtag popularity. This one asks a harder question: can the semantic content of a caption, on its own, predict how a video performs? To answer it cleanly, every caption is mapped into a 64-dimensional LSA embedding where similar meanings sit close together, and a Ridge model learns the relationship between that meaning-space and an engagement index — measured on a controlled synthetic dataset so the answer is verifiable.
It's a clean way to study a messy phenomenon. On a controlled synthetic dataset where engagement is planted to depend on topic and sentiment — not length or hashtag count — you can measure exactly how much the meaning explains, and prove a metadata baseline can't keep up.
The stack
Embeddings turn language into something a model can actually learn from.
1,200 seeded captions whose engagement is planted to depend on topic and sentiment — known ground truth, clearly labelled synthetic.
TF-IDF over 1–2 grams then truncated SVD to 64 dimensions, L2-normalised. Fully local, fast, deterministic — no model download.
The standardised embedding feeds a Ridge model predicting a 0–100 engagement index. GradientBoosting is supported too.
The same regressor on length, word count, #hashtags and #emoji — the surface signals a semantic model is meant to beat.
Both models scored on the same 25% test split, embedder fit on train only — no leakage. Semantic R²≈0.69 vs ≈0.00.
11 tests: semantic beats baseline, embedding shapes/dims, and deterministic results under a fixed seed.
Architecture
Every caption travels the same path from text to number:
Build seeded synthetic captions whose engagement is planted to depend on topic and sentiment.
Hold out 25% for testing; the embedder only ever sees the training captions.
Encode each caption into a 64-dim LSA vector (TF-IDF + truncated SVD).
Train a Ridge regressor to predict the engagement index from the embedding.
Score against a metadata-only baseline on the same split and report held-out R² honestly.
Reflection