A thousand synthetic credit-card transactions, ~1.5% of them fraudulent, scored by a calibrated model. Drag the threshold and watch the confusion matrix, precision, recall, and live alert stream re-classify in real time.
Synthetic dataset, deterministic across reloads. The "model" is a calibrated scorer with realistic overlap — frauds tend to score high, legits low, but the two distributions overlap enough that no threshold is perfect.
Each point is a different threshold. The marker is your current pick.
Two abstract feature axes. Frauds tend to cluster top-right; legits to the lower-left. Marker outline reflects what the current threshold predicts.
A binary classifier doesn't give you an answer — it gives you a score between 0 and 1. Someone has to choose where to draw the line. That's the threshold.
On a balanced dataset, 0.5 is a fine default. On imbalanced data — like credit-card fraud where positives are well under 1% of all transactions — the same threshold will catch almost nothing, because the model's distribution of scores is concentrated near zero. Lower the threshold and you start catching more fraud (recall climbs) but you also start raising more false alarms (precision falls). Move it the other way and you flag almost nothing — your precision looks great, but you're missing real fraud.
The right threshold is a business decision, not a statistics one: how much is a missed fraud worth versus a customer-annoying false alarm? The dashboard above lets you feel that trade-off rather than read about it. Watch the confusion matrix as you drag, and notice that "accuracy" barely moves — that's why we don't use it on this problem.