Machine learning · interactive demo

When every $1 matters.

A thousand synthetic credit-card transactions, ~1.5% of them fraudulent, scored by a calibrated model. Drag the threshold and watch the confusion matrix, precision, recall, and live alert stream re-classify in real time.

Open the dashboard Why threshold matters

Live

Threshold dashboard

Synthetic dataset, deterministic across reloads. The "model" is a calibrated scorer with realistic overlap — frauds tend to score high, legits low, but the two distributions overlap enough that no threshold is perfect.

Decision threshold 0.50

flag fewer (precision↑) flag more (recall↑)

Precision — of flagged, share that's truly fraud

Recall — of all fraud, share you caught

F1 — harmonic mean of P & R

Accuracy — misleading on imbalanced data

Confusion matrix

truth prediction legit fraud legit fraud

0true negative

0false alarm

0missed fraud

0caught fraud

ROC curve

Each point is a different threshold. The marker is your current pick.

AUC = —

All 1,000 transactions, projected

Two abstract feature axes. Frauds tend to cluster top-right; legits to the lower-left. Marker outline reflects what the current threshold predicts.

Live transaction stream

Reading the dashboard

Why threshold is the whole game.

A binary classifier doesn't give you an answer — it gives you a score between 0 and 1. Someone has to choose where to draw the line. That's the threshold.

On a balanced dataset, 0.5 is a fine default. On imbalanced data — like credit-card fraud where positives are well under 1% of all transactions — the same threshold will catch almost nothing, because the model's distribution of scores is concentrated near zero. Lower the threshold and you start catching more fraud (recall climbs) but you also start raising more false alarms (precision falls). Move it the other way and you flag almost nothing — your precision looks great, but you're missing real fraud.

The right threshold is a business decision, not a statistics one: how much is a missed fraud worth versus a customer-annoying false alarm? The dashboard above lets you feel that trade-off rather than read about it. Watch the confusion matrix as you drag, and notice that "accuracy" barely moves — that's why we don't use it on this problem.

What the metrics mean here

Precision — of the transactions you flagged, what fraction were really fraud.
Recall — of the real fraud cases, what fraction you successfully flagged.
F1 — the harmonic mean of precision and recall; a single number that punishes lopsided trade-offs.
Accuracy — fraction of all transactions you classified correctly. On a 98.5% negative class, a model that flags nothing already scores 98.5%.
AUC — the area under the ROC curve. Threshold-independent quality of the underlying scorer.