compl.ai — AI compliance rule engine

Try it

Audit an example system, live

The real engine is Python (complai/ — a YAML requirements framework, a rule engine, text/JSON reports, pytest-covered). The widget below is a faithful re-implementation of the same rules and scoring in JavaScript so you can see an audit happen in the browser. Pick a risk tier, toggle the safeguards a system has in place, and watch the report update.

RISK TIER PRESET

— —

ID	Severity	Status	Requirement / evidence

Score = 100 × (earned severity weight) ÷ (total applicable weight). Pass earns full weight, partial half, fail zero. Severity weights: low 1, medium 3, high 6, critical 10.

Honest scope. compl.ai is a structured self-assessment checklist, not legal advice and not certification. Requirement wording is a plain-language paraphrase of public frameworks for engineering self-review — verify against the current regulation.

What it is

Regulation, turned into a checklist that runs

AI regulation arrives as dense prose — obligations about transparency, data quality, documentation, human oversight and risk management. compl.ai translates that prose into a catalogue of discrete, checkable requirements, then evaluates a given system against each one. The output isn't a vibe; it's a report: this requirement met, that one failed, this one needs evidence — with a risk-weighted score on top.

The grounding is real. Each requirement cites a principle from the EU AI Act (Reg. (EU) 2024/1689 — risk management, data governance, documentation, transparency, human oversight, accuracy & robustness) or the NIST AI Risk Management Framework 1.0. Systems are classified by risk tier, because obligations scale with how much harm a system could do.

The stack

From a regulation to an audit report

Model the rules, classify the system, evaluate, score, report.

model

Requirements catalogue

15 requirements in YAML, each with id, severity, an applicable-tier list, a declarative check and a source citation.

classify

Risk tiering

minimal / limited / high. High-risk systems trigger stricter requirements lower tiers skip.

intake

System spec

A model-card-like YAML/JSON document: intended use, datasets, oversight, accuracy, logging.

evaluate

Rule engine

Runs each check (field_true / present / gte / in) and resolves pass / fail / partial with evidence.

score

Risk-weighted score

Severity-weighted 0–100 score plus a risk-exposure total so the worst gaps surface first.

report

Report (text + JSON)

An itemised result with evidence, source and remediation for every gap.

Real example audits

Two systems, run through the Python engine

From python demo.py over the two bundled specs — these are real outputs, not mock-ups:

HireScreen v3 (CV screening), high-risk — 100.0/100, compliant. Every one of the 15 applicable requirements passes.
TriageBot (clinical triage assistant), high-risk — 28.4/100, 9 gaps. Fails exactly the safeguards it omits: continuous risk monitoring, dataset bias review, documentation, human oversight, a stop mechanism, the accuracy floor, robustness testing, bias evaluation and logging — risk exposure 63, three of them critical.

Reflection

What building it taught me

Modelling the rules is the project. Turning prose obligations into discrete, checkable requirements with a severity and a source is most of the work — and most of the value.
Risk tiers change everything. The same system faces wildly different obligations depending on how much harm it could cause, so applicability has to be data, not code.
Severity weighting makes a score honest. Missing a critical safeguard should hurt the score far more than a missing contact field — the weighting is what makes the number mean something.
Evidence and remediation beat verdicts. A "fail" is only useful if it says what proof was missing and the concrete next step to close it.
Compliance is a moving target. Regulations evolve, so the framework is a YAML file you update — not logic baked into the engine.