A safety-evaluation framework that measures how encoder-decoder models — BART, T5, PEGASUS — degrade across multi-turn conversations. Five attack scenarios, six metrics, one dashboard. Pick a scenario, press play, and watch the safety scores fall in real time.
Each scenario simulates a different way a multi-turn conversation can erode a model's safety alignment. The metric curves below are illustrative — driven by per-scenario drift profiles calibrated to mimic the real evaluation behavior.
Turns are abstracted to focus on the drift mechanism — see the metric panel for what's actually being measured.
Each model's curve is overlaid for the current scenario.
Fraction of model turns that stay within the safety policy. Range 0 – 1. A healthy model holds the line; a drifting one trends down.
The slope of SCS across turns. Big numbers mean the model is losing safety fast; small numbers mean the conversation is dragging it down slowly.
How often the model adopts the user's framing without pushback. High OAI flags sycophancy; low OAI is a model willing to disagree.
Inverse of OAI: how well the model still follows the original safe instructions even when the user pulls in another direction.
The first turn where SCS crosses below 0.50. Earlier tipping points mean weaker multi-turn robustness.
How spread out the model's attention is. High entropy = attention is everywhere (fuzzy); low entropy = sharply focused. Drift often shows up as a sudden entropy spike.
This page is a portfolio visualization of an academic NLP project that measures alignment drift across long conversations. The original work runs the evaluation pipeline on BART, T5, and PEGASUS for five attack-shaped conversation patterns and produces the metrics shown above.
The dashboard you're using is not running real models in your browser — that would require gigabytes of weights. Instead, each scenario has a hand-tuned drift profile that matches the general shape of the real measurements, and each model has its own resistance multiplier so the three curves on the chart behave plausibly (T5 holds out longest, PEGASUS folds first, BART sits in the middle). The simulator is here so the structure of the result is immediately legible — the actual numbers live in the original repository.
Credits. Original research and evaluation pipeline by Geethika (Geethika2506/NLP-project). This dashboard rebuilt from scratch as a portfolio piece by Andrea Montana, IE BCSAI.