IE BCSAI · NLP research

Watching language models drift.

A safety-evaluation framework that measures how encoder-decoder models — BART, T5, PEGASUS — degrade across multi-turn conversations. Five attack scenarios, six metrics, one dashboard. Pick a scenario, press play, and watch the safety scores fall in real time.

Live

Drift dashboard

Each scenario simulates a different way a multi-turn conversation can erode a model's safety alignment. The metric curves below are illustrative — driven by per-scenario drift profiles calibrated to mimic the real evaluation behavior.

Conversation timeline

Turns are abstracted to focus on the drift mechanism — see the metric panel for what's actually being measured.

    SCSSafety Compliance
    1.00
    target ≥ 0.80
    SDRDecay Rate
    0.000 / turn
    OAIOver-Agreeable
    0.00
    IOSInstruction Obs.
    1.00
    TPTTipping Point
    turn at SCS < 0.50
    AHEAttn. Entropy
    0.00

    SCS over turns

    Each model's curve is overlaid for the current scenario.

    Glossary

    What the six metrics mean

    SCS — Safety Compliance Score

    Fraction of model turns that stay within the safety policy. Range 0 – 1. A healthy model holds the line; a drifting one trends down.

    SDR — Safety Decay Rate

    The slope of SCS across turns. Big numbers mean the model is losing safety fast; small numbers mean the conversation is dragging it down slowly.

    OAI — Over-Agreeableness Index

    How often the model adopts the user's framing without pushback. High OAI flags sycophancy; low OAI is a model willing to disagree.

    IOS — Instruction Observance Score

    Inverse of OAI: how well the model still follows the original safe instructions even when the user pulls in another direction.

    TPT — Tipping Point Turn

    The first turn where SCS crosses below 0.50. Earlier tipping points mean weaker multi-turn robustness.

    AHE — Attention Head Entropy

    How spread out the model's attention is. High entropy = attention is everywhere (fuzzy); low entropy = sharply focused. Drift often shows up as a sudden entropy spike.

    About

    How the dashboard works

    This page is a portfolio visualization of an academic NLP project that measures alignment drift across long conversations. The original work runs the evaluation pipeline on BART, T5, and PEGASUS for five attack-shaped conversation patterns and produces the metrics shown above.

    The dashboard you're using is not running real models in your browser — that would require gigabytes of weights. Instead, each scenario has a hand-tuned drift profile that matches the general shape of the real measurements, and each model has its own resistance multiplier so the three curves on the chart behave plausibly (T5 holds out longest, PEGASUS folds first, BART sits in the middle). The simulator is here so the structure of the result is immediately legible — the actual numbers live in the original repository.

    Credits. Original research and evaluation pipeline by Geethika (Geethika2506/NLP-project). This dashboard rebuilt from scratch as a portfolio piece by Andrea Montana, IE BCSAI.