0
times
Out of 13,752 attempts to push the agent into harmful or manipulative behaviour, it gave in zero times.
Then we publish what happened. Every number on this page comes from a real, public test.
Before we trust an AI, we attack it thousands of times.
An AI agent that can act on its own is only as good as its restraint. So we spend most of our effort trying to make ours misbehave — then measure two things: did it stay safe, and did it stay smart? This is the scoreboard.
0
times
Out of 13,752 attempts to push the agent into harmful or manipulative behaviour, it gave in zero times.
99.4%
of attacks stopped
Across 13,752 attacks spanning 11 different AI models, our method blocked 99.4% of them.
96.76%
less harmful
On a standard safety exam, harmful answers dropped by 96.76% — from 54% of the time down to 1.75%.
96.45%
still smart
Safety didn’t make it dumber: it kept 96.45% of its problem-solving ability across 15 models.
We bombard the AI with harmful requests and disguised “jailbreak” prompts, then count how often it gives in. Lower is better — and we push it toward zero.
Safety is easy if you make an AI refuse everything. So we also test it on real science, math, and reasoning. The score has to stay high — and it did.
HarmBench
A large library of harmful requests. We measure how often the AI refuses them.
Harmful answers fell from 54% to 1.75%.
Jailbreaks (PAIR & GCG)
Automated attacks that disguise a harmful request to sneak it past the AI’s guard.
Success rate driven from ~80–99% down to roughly 0%.
Agentic misalignment
Give the AI freedom to act on its own, then watch for scheming — lying, self-preservation, manipulation.
0 harmful behaviours across 8,632 scenarios and 6 AI architectures.
The Liar’s Benchmark
After the AI acts, does it tell the truth about what it actually did?
72% of jailbroken models lied. Ours: none.
Capability exams (GPQA · GSM8K · MMLU)
Graduate-level science, grade-school math, and general knowledge — is it still sharp?
Held steady or improved across the board (15 models, 29,561 questions).
Council-SIFT (digital forensics)
A forensics agent that reviews its own conclusions and throws out any it can’t back with evidence.
Caught 85 out of 85 unsupported findings.
We don’t ask you to take our word for it — the raw data, scripts, and reports are all public on github.com/davfd.
Behind these public tests sits an internal review Council that scores every new experiment. The rule we hold to: a strong result is logged with its limits attached. A great score becomes “a promising candidate — not yet cleared for real-world use,” never a finished claim.
That discipline — honest boundaries on every number — is the point. A benchmark you can’t trust the framing of isn’t a benchmark; it’s marketing.
Where these numbers come from
Everything here was gathered from public GitHub repositories on 10 June 2026 — 11 repositories checked, 10 carrying test results. For safety, we deliberately do not republish the actual harmful prompts; this page records the tests, the counts, and the outcomes, with links to the full reports.