Semantic Engagement — Built From Scratch

What it is

Predicting reach from meaning

Most engagement models lean on surface signals — follower count, posting time, hashtag popularity. This one asks a harder question: can the semantic content of a caption, on its own, predict how a video performs? To answer it cleanly, every caption is mapped into a 64-dimensional LSA embedding where similar meanings sit close together, and a Ridge model learns the relationship between that meaning-space and an engagement index — measured on a controlled synthetic dataset so the answer is verifiable.

It's a clean way to study a messy phenomenon. On a controlled synthetic dataset where engagement is planted to depend on topic and sentiment — not length or hashtag count — you can measure exactly how much the meaning explains, and prove a metadata baseline can't keep up.

0.69

held-out R² from the semantic model — versus ≈0.00 for a metadata-only baseline (length, #hashtags, #emoji) on the same split. The signal really does live in the meaning.

The stack

From caption to engagement estimate

Embeddings turn language into something a model can actually learn from.

data

Synthetic captions

1,200 seeded captions whose engagement is planted to depend on topic and sentiment — known ground truth, clearly labelled synthetic.

represent

LSA embeddings

TF-IDF over 1–2 grams then truncated SVD to 64 dimensions, L2-normalised. Fully local, fast, deterministic — no model download.

model

Ridge regressor

The standardised embedding feeds a Ridge model predicting a 0–100 engagement index. GradientBoosting is supported too.

baseline

Metadata-only

The same regressor on length, word count, #hashtags and #emoji — the surface signals a semantic model is meant to beat.

evaluate

Held-out R² & r

Both models scored on the same 25% test split, embedder fit on train only — no leakage. Semantic R²≈0.69 vs ≈0.00.

test

pytest suite

11 tests: semantic beats baseline, embedding shapes/dims, and deterministic results under a fixed seed.

Architecture

How a prediction is made

Every caption travels the same path from text to number:

Generate
Build seeded synthetic captions whose engagement is planted to depend on topic and sentiment.
Split
Hold out 25% for testing; the embedder only ever sees the training captions.
Embed
Encode each caption into a 64-dim LSA vector (TF-IDF + truncated SVD).
Fit
Train a Ridge regressor to predict the engagement index from the embedding.
Compare
Score against a metadata-only baseline on the same split and report held-out R² honestly.

Reflection

What rebuilding it taught me

Embeddings unlock text. Raw captions are unusable to a regressor; an LSA embedding turns word usage into geometry a Ridge model can fit — here to R²≈0.69.
A fair baseline is the whole point. The result only means something because the metadata baseline (≈0.00) was given an honest shot on the same split.
Design the data so the truth is knowable. Planting engagement on topic — and sampling hashtags/emoji independently — is what makes "meaning beats metadata" testable rather than assumed.
LSA is honest, not magic. It captures topical similarity, not sarcasm or word order — a transformer could slot in as the embedder, but the default is local and reproducible.
Correlation ≠ causation, loudly. A theme that predicts engagement isn't a recipe to cause it — easy to forget, important to say.