Where AI fails. In detail.

The leaderboard tells you which model wins. This page tells you why — failure mode catalogs, evidence quality scores, citation accuracy, hallucination distributions, and the cross-domain weaknesses every vendor pretends don't exist.

Where models invent. Where they don't.

Hallucination = the model produced a claim, the panel rejected it, no model rebuttal succeeded. Lower is better.

Legal Citations

142K verifications
  • Claude 4.58.2%
  • GPT-4o11.4%
  • Gemini 2.513.7%
  • Grok 421.8%
  • DeepSeek V315.1%
Pattern: Fabricated case names cluster around state-court matters; federal cites are far more reliable across all models.

Statistical Claims

87K verifications
  • Claude 4.55.9%
  • DeepSeek V37.4%
  • GPT-4o8.8%
  • Gemini 2.510.6%
  • Grok 416.8%
Pattern: Specific percentage figures (e.g. "27.1%") are 3× more likely to be invented than rounded figures.

API / Library Names

214K verifications
  • DeepSeek V34.8%
  • Claude 4.57.1%
  • GPT-4o9.2%
  • Gemini 2.512.7%
  • Grok 417.4%
Pattern: Invented function signatures for real libraries are the most common code hallucination — far more than invented libraries.

Medical Dosages

68K verifications
  • Claude 4.53.1%
  • Gemini 2.54.8%
  • GPT-4o6.2%
  • DeepSeek V313.8%
  • Grok 418.4%
Pattern: Dosage errors are dose-form specific — IV/IM dosing is 2.3× more likely to be wrong than oral.

Regulatory References

180K verifications
  • Gemini 2.53.6%
  • Claude 4.55.2%
  • GPT-4o9.7%
  • DeepSeek V313.3%
  • Grok 417.6%
Pattern: EU framework citations beat US across all models — likely a function of training data density.

Financial Figures

156K verifications
  • GPT-4o7.8%
  • Claude 4.59.4%
  • Gemini 2.511.7%
  • DeepSeek V314.2%
  • Grok 419.6%

Why panels disagree.

RANK FAILURE MODE DOMAIN FREQUENCY WORST MODEL BEST MODEL
01Fabricated case-law citationLegal14.2% of legal QGrok 4 (22%)Claude 4.5 (8%)
02Over-confident dosage assertionHealthcare9.7% of clinical QGrok 4 (18%)Claude 4.5 (3%)
03Invented API method signatureCode9.4% of API QGrok 4 (17%)DeepSeek (5%)
04Misapplied GDPR ArticleCompliance7.1% of EU QGrok 4 (18%)Gemini 2.5 (3%)
05Conflated similar drug namesHealthcare6.8% of pharm QGrok 4 (12%)Claude 4.5 (2%)
06Specific-percentage fabricationStatistics6.4% of stat QGrok 4 (17%)Claude 4.5 (6%)
07Wrong statute jurisdictionLegal5.9% of legal QGrok 4 (16%)Claude 4.5 (4%)
08Outdated framework versionCompliance5.2% of audit QGrok 4 (12%)Gemini 2.5 (3%)
09Incorrect order of magnitudeFinance4.8% of fin QGrok 4 (14%)GPT-4o (4%)
10Mismatched protocol guidelineHealthcare4.1% of clinical QDeepSeek (14%)Claude 4.5 (2%)
Frequency = % of questions in domain that exhibited this mode at least once across the panel. Each row links to the failure-mode methodology page with 20+ example cases.

Where each model is moving.

REGRESSIONMay 24, 2026

Gemini compliance accuracy dropped 4.4 points after vendor refresh.

Following a Google-side refresh on May 21, Gemini's score on GDPR Article 35 citations fell from 96.8 → 92.4. We've flagged this with the vendor; users on Tier 2 routing have been auto-rebalanced toward Claude/GPT-4o for compliance domain.

n = 12,847 verifications · confidence 0.97
IMPROVEMENTMay 22, 2026

DeepSeek V3 closed the security-code gap to within 2 points.

DeepSeek's code-security hallucination rate has fallen from 9.1% to 4.8% over 12 weeks. The model now leads on API method-signature accuracy by a measurable margin.

n = 104,233 verifications · trend +2.1/wk
STABLEMay 19, 2026

Claude's lead on legal citation has held flat for 8 weeks.

Claude 4.5 has maintained a 3+ point lead on legal-domain accuracy since its release. The gap is widest on California and New York case-law citations — narrower on federal.

n = 184,032 verifications · σ = 0.4 pp
CHALLENGERMay 17, 2026

Grok's dissent rate has risen 6 points — and so has its concession rate.

Grok 4 challenges consensus more aggressively after its recent training refresh (now 44% of rounds, up from 38%). But it also concedes more readily — concession rate up from 64% to 71% within two rounds.

n = 98,447 verifications · 3 round-avg debate length
EMERGENCEMay 14, 2026

Qwen 3 Max is the fastest-improving panel member.

Up 4.7 points in two weeks since being added to the panel. Currently weighted at 5% of Tier 1 routing while it gathers panel-history data.

n = 122 verifications (early)
CROSS-DOMAINMay 11, 2026

All models fail more often on novel jurisdiction combinations.

When questions span multiple unfamiliar regulatory regimes (crypto + EU + APAC), agreement increases while accuracy falls. This is the most dangerous panel state — high consensus, low survival under expert review.

How we measure failure.

Hallucination rate

A claim counts as a hallucination if (a) the model asserted it, (b) at least 3 of the other 4 panel members rejected it, (c) the assertion was not rescued by panel-supplied evidence in subsequent rounds, and (d) human reviewers confirmed the rejection on a randomized 1% audit sample.

This is a stricter standard than self-reported softmax confidence — and far stricter than retrieval-only benchmark sets where the test data is publicly known.

Failure mode catalog

Failure modes are computed by clustering rejected claims by content, then human-naming the cluster. The catalog is open and auditable: see github.com/bridge-protocol/failure-modes.

Failure modes are versioned and timestamped. Models that improve on a specific failure mode after a vendor update are surfaced in the drift analysis section.

DOWNLOAD METHODOLOGY (PDF · 48p) ↻ See the leaderboard