IE BCSAI · NLP research

Watching language models drift.

A safety-evaluation framework that measures how encoder-decoder models — BART, T5, PEGASUS — degrade across multi-turn conversations. Five attack scenarios, six metrics, one dashboard. Pick a scenario, press play, and watch the safety scores fall in real time.

Run a scenario What are the metrics?

Live

Drift dashboard

Each scenario simulates a different way a multi-turn conversation can erode a model's safety alignment. The metric curves below are illustrative — driven by per-scenario drift profiles calibrated to mimic the real evaluation behavior.

Scenario

Model

Playback

Conversation timeline

Turns are abstracted to focus on the drift mechanism — see the metric panel for what's actually being measured.

SCSSafety Compliance

1.00

target ≥ 0.80

SDRDecay Rate

0.000 / turn

OAIOver-Agreeable

0.00

IOSInstruction Obs.

1.00

TPTTipping Point

—

turn at SCS < 0.50

AHEAttn. Entropy

0.00

SCS over turns

Each model's curve is overlaid for the current scenario.

Glossary

What the six metrics mean

SCS — Safety Compliance Score

Fraction of model turns that stay within the safety policy. Range 0 – 1. A healthy model holds the line; a drifting one trends down.

SDR — Safety Decay Rate

The slope of SCS across turns. Big numbers mean the model is losing safety fast; small numbers mean the conversation is dragging it down slowly.

OAI — Over-Agreeableness Index

How often the model adopts the user's framing without pushback. High OAI flags sycophancy; low OAI is a model willing to disagree.

IOS — Instruction Observance Score

Inverse of OAI: how well the model still follows the original safe instructions even when the user pulls in another direction.

TPT — Tipping Point Turn

The first turn where SCS crosses below 0.50. Earlier tipping points mean weaker multi-turn robustness.

AHE — Attention Head Entropy

How spread out the model's attention is. High entropy = attention is everywhere (fuzzy); low entropy = sharply focused. Drift often shows up as a sudden entropy spike.

About

How the dashboard works

This page is a portfolio visualization of an academic NLP project that measures alignment drift across long conversations. The original work runs the evaluation pipeline on BART, T5, and PEGASUS for five attack-shaped conversation patterns and produces the metrics shown above.

The dashboard you're using is not running real models in your browser — that would require gigabytes of weights. Instead, each scenario has a hand-tuned drift profile that matches the general shape of the real measurements, and each model has its own resistance multiplier so the three curves on the chart behave plausibly (T5 holds out longest, PEGASUS folds first, BART sits in the middle). The simulator is here so the structure of the result is immediately legible — the actual numbers live in the original repository.

Credits. Original research and evaluation pipeline by Geethika (Geethika2506/NLP-project). This dashboard rebuilt from scratch as a portfolio piece by Andrea Montana, IE BCSAI.