From-Scratch Build · Natural Language Processing

Semantic Engagement

Why does one short video explode and a near-identical one sink? Part of the answer is hiding in the caption — not the hashtags, the meaning. This build predicts engagement from the semantics of the text, turning captions into embeddings and learning what kinds of meaning travel. I rebuilt it from scratch to see how far language alone can explain virality.

PythonLSA embeddingsscikit-learn Ridge regressionHeld-out evalNLP

What it is

Predicting reach from meaning

Most engagement models lean on surface signals — follower count, posting time, hashtag popularity. This one asks a harder question: can the semantic content of a caption, on its own, predict how a video performs? To answer it cleanly, every caption is mapped into a 64-dimensional LSA embedding where similar meanings sit close together, and a Ridge model learns the relationship between that meaning-space and an engagement index — measured on a controlled synthetic dataset so the answer is verifiable.

It's a clean way to study a messy phenomenon. On a controlled synthetic dataset where engagement is planted to depend on topic and sentiment — not length or hashtag count — you can measure exactly how much the meaning explains, and prove a metadata baseline can't keep up.

0.69
held-out R² from the semantic model — versus ≈0.00 for a metadata-only baseline (length, #hashtags, #emoji) on the same split. The signal really does live in the meaning.

The stack

From caption to engagement estimate

Embeddings turn language into something a model can actually learn from.

data

Synthetic captions

1,200 seeded captions whose engagement is planted to depend on topic and sentiment — known ground truth, clearly labelled synthetic.

represent

LSA embeddings

TF-IDF over 1–2 grams then truncated SVD to 64 dimensions, L2-normalised. Fully local, fast, deterministic — no model download.

model

Ridge regressor

The standardised embedding feeds a Ridge model predicting a 0–100 engagement index. GradientBoosting is supported too.

baseline

Metadata-only

The same regressor on length, word count, #hashtags and #emoji — the surface signals a semantic model is meant to beat.

evaluate

Held-out R² & r

Both models scored on the same 25% test split, embedder fit on train only — no leakage. Semantic R²≈0.69 vs ≈0.00.

test

pytest suite

11 tests: semantic beats baseline, embedding shapes/dims, and deterministic results under a fixed seed.

Architecture

How a prediction is made

Every caption travels the same path from text to number:

  1. Generate

    Build seeded synthetic captions whose engagement is planted to depend on topic and sentiment.

  2. Split

    Hold out 25% for testing; the embedder only ever sees the training captions.

  3. Embed

    Encode each caption into a 64-dim LSA vector (TF-IDF + truncated SVD).

  4. Fit

    Train a Ridge regressor to predict the engagement index from the embedding.

  5. Compare

    Score against a metadata-only baseline on the same split and report held-out R² honestly.

Reflection

What rebuilding it taught me