The leaderboard tells you which model wins. This page tells you why — failure mode catalogs, evidence quality scores, citation accuracy, hallucination distributions, and the cross-domain weaknesses every vendor pretends don't exist.
Hallucination = the model produced a claim, the panel rejected it, no model rebuttal succeeded. Lower is better.
| RANK | FAILURE MODE | DOMAIN | FREQUENCY | WORST MODEL | BEST MODEL |
|---|---|---|---|---|---|
| 01 | Fabricated case-law citation | Legal | 14.2% of legal Q | Grok 4 (22%) | Claude 4.5 (8%) |
| 02 | Over-confident dosage assertion | Healthcare | 9.7% of clinical Q | Grok 4 (18%) | Claude 4.5 (3%) |
| 03 | Invented API method signature | Code | 9.4% of API Q | Grok 4 (17%) | DeepSeek (5%) |
| 04 | Misapplied GDPR Article | Compliance | 7.1% of EU Q | Grok 4 (18%) | Gemini 2.5 (3%) |
| 05 | Conflated similar drug names | Healthcare | 6.8% of pharm Q | Grok 4 (12%) | Claude 4.5 (2%) |
| 06 | Specific-percentage fabrication | Statistics | 6.4% of stat Q | Grok 4 (17%) | Claude 4.5 (6%) |
| 07 | Wrong statute jurisdiction | Legal | 5.9% of legal Q | Grok 4 (16%) | Claude 4.5 (4%) |
| 08 | Outdated framework version | Compliance | 5.2% of audit Q | Grok 4 (12%) | Gemini 2.5 (3%) |
| 09 | Incorrect order of magnitude | Finance | 4.8% of fin Q | Grok 4 (14%) | GPT-4o (4%) |
| 10 | Mismatched protocol guideline | Healthcare | 4.1% of clinical Q | DeepSeek (14%) | Claude 4.5 (2%) |
Following a Google-side refresh on May 21, Gemini's score on GDPR Article 35 citations fell from 96.8 → 92.4. We've flagged this with the vendor; users on Tier 2 routing have been auto-rebalanced toward Claude/GPT-4o for compliance domain.
DeepSeek's code-security hallucination rate has fallen from 9.1% to 4.8% over 12 weeks. The model now leads on API method-signature accuracy by a measurable margin.
Claude 4.5 has maintained a 3+ point lead on legal-domain accuracy since its release. The gap is widest on California and New York case-law citations — narrower on federal.
Grok 4 challenges consensus more aggressively after its recent training refresh (now 44% of rounds, up from 38%). But it also concedes more readily — concession rate up from 64% to 71% within two rounds.
Up 4.7 points in two weeks since being added to the panel. Currently weighted at 5% of Tier 1 routing while it gathers panel-history data.
When questions span multiple unfamiliar regulatory regimes (crypto + EU + APAC), agreement increases while accuracy falls. This is the most dangerous panel state — high consensus, low survival under expert review.
A claim counts as a hallucination if (a) the model asserted it, (b) at least 3 of the other 4 panel members rejected it, (c) the assertion was not rescued by panel-supplied evidence in subsequent rounds, and (d) human reviewers confirmed the rejection on a randomized 1% audit sample.
This is a stricter standard than self-reported softmax confidence — and far stricter than retrieval-only benchmark sets where the test data is publicly known.
Failure modes are computed by clustering rejected claims by content, then human-naming the cluster. The catalog is open and auditable: see github.com/bridge-protocol/failure-modes.
Failure modes are versioned and timestamped. Models that improve on a specific failure mode after a vendor update are surfaced in the drift analysis section.