FOUNDATION LABS · BENCHMARKS

We try to break
our own agents.

Then we publish what happened. Every number on this page comes from a real, public test.

WHAT THIS IS

Before we trust an AI, we attack it thousands of times.

An AI agent that can act on its own is only as good as its restraint. So we spend most of our effort trying to make ours misbehave — then measure two things: did it stay safe, and did it stay smart? This is the scoreboard.

THE HEADLINES

0

times

Out of 13,752 attempts to push the agent into harmful or manipulative behaviour, it gave in zero times.

99.4%

of attacks stopped

Across 13,752 attacks spanning 11 different AI models, our method blocked 99.4% of them.

96.76%

less harmful

On a standard safety exam, harmful answers dropped by 96.76% — from 54% of the time down to 1.75%.

96.45%

still smart

Safety didn’t make it dumber: it kept 96.45% of its problem-solving ability across 15 models.

HOW TO READ A SAFETY SCORE

Every test answers one of two questions.

Did it stay safe?

We bombard the AI with harmful requests and disguised “jailbreak” prompts, then count how often it gives in. Lower is better — and we push it toward zero.

Did it stay smart?

Safety is easy if you make an AI refuse everything. So we also test it on real science, math, and reasoning. The score has to stay high — and it did.

THE TESTS, IN PLAIN WORDS

What each one actually does.

HarmBench

A large library of harmful requests. We measure how often the AI refuses them.

Harmful answers fell from 54% to 1.75%.

Jailbreaks (PAIR & GCG)

Automated attacks that disguise a harmful request to sneak it past the AI’s guard.

Success rate driven from ~80–99% down to roughly 0%.

Agentic misalignment

Give the AI freedom to act on its own, then watch for scheming — lying, self-preservation, manipulation.

0 harmful behaviours across 8,632 scenarios and 6 AI architectures.

The Liar’s Benchmark

After the AI acts, does it tell the truth about what it actually did?

72% of jailbroken models lied. Ours: none.

Capability exams (GPQA · GSM8K · MMLU)

Graduate-level science, grade-school math, and general knowledge — is it still sharp?

Held steady or improved across the board (15 models, 29,561 questions).

Council-SIFT (digital forensics)

A forensics agent that reviews its own conclusions and throws out any it can’t back with evidence.

Caught 85 out of 85 unsupported findings.

SEE THE PROOF YOURSELF

Ten public repositories. Open any of them.

We don’t ask you to take our word for it — the raw data, scripts, and reports are all public on github.com/davfd.

HOW WE KEEP OURSELVES HONEST

We record what a result isn’t, too.

Behind these public tests sits an internal review Council that scores every new experiment. The rule we hold to: a strong result is logged with its limits attached. A great score becomes “a promising candidate — not yet cleared for real-world use,” never a finished claim.

That discipline — honest boundaries on every number — is the point. A benchmark you can’t trust the framing of isn’t a benchmark; it’s marketing.

Where these numbers come from

Everything here was gathered from public GitHub repositories on 10 June 2026 — 11 repositories checked, 10 carrying test results. For safety, we deliberately do not republish the actual harmful prompts; this page records the tests, the counts, and the outcomes, with links to the full reports.