Ratings · BRIDGE

Where models invent. Where they don't.

Hallucination = the model produced a claim, the panel rejected it, no model rebuttal succeeded. Lower is better.

Legal Citations

142K verifications

Claude 4.58.2%
GPT-4o11.4%
Gemini 2.513.7%
Grok 421.8%
DeepSeek V315.1%

Statistical Claims

87K verifications

Claude 4.55.9%
DeepSeek V37.4%
GPT-4o8.8%
Gemini 2.510.6%
Grok 416.8%

API / Library Names

214K verifications

DeepSeek V34.8%
Claude 4.57.1%
GPT-4o9.2%
Gemini 2.512.7%
Grok 417.4%

Medical Dosages

68K verifications

Claude 4.53.1%
Gemini 2.54.8%
GPT-4o6.2%
DeepSeek V313.8%
Grok 418.4%

Regulatory References

180K verifications

Gemini 2.53.6%
Claude 4.55.2%
GPT-4o9.7%
DeepSeek V313.3%
Grok 417.6%

Financial Figures

156K verifications

GPT-4o7.8%
Claude 4.59.4%
Gemini 2.511.7%
DeepSeek V314.2%
Grok 419.6%

Why panels disagree.

RANK	FAILURE MODE	DOMAIN	FREQUENCY	WORST MODEL	BEST MODEL
01	Fabricated case-law citation	Legal	14.2% of legal Q	Grok 4 (22%)	Claude 4.5 (8%)
02	Over-confident dosage assertion	Healthcare	9.7% of clinical Q	Grok 4 (18%)	Claude 4.5 (3%)
03	Invented API method signature	Code	9.4% of API Q	Grok 4 (17%)	DeepSeek (5%)
04	Misapplied GDPR Article	Compliance	7.1% of EU Q	Grok 4 (18%)	Gemini 2.5 (3%)
05	Conflated similar drug names	Healthcare	6.8% of pharm Q	Grok 4 (12%)	Claude 4.5 (2%)
06	Specific-percentage fabrication	Statistics	6.4% of stat Q	Grok 4 (17%)	Claude 4.5 (6%)
07	Wrong statute jurisdiction	Legal	5.9% of legal Q	Grok 4 (16%)	Claude 4.5 (4%)
08	Outdated framework version	Compliance	5.2% of audit Q	Grok 4 (12%)	Gemini 2.5 (3%)
09	Incorrect order of magnitude	Finance	4.8% of fin Q	Grok 4 (14%)	GPT-4o (4%)
10	Mismatched protocol guideline	Healthcare	4.1% of clinical Q	DeepSeek (14%)	Claude 4.5 (2%)

Frequency = % of questions in domain that exhibited this mode at least once across the panel. Each row links to the failure-mode methodology page with 20+ example cases.

Where each model is moving.

REGRESSIONMay 24, 2026

Gemini compliance accuracy dropped 4.4 points after vendor refresh.

Following a Google-side refresh on May 21, Gemini's score on GDPR Article 35 citations fell from 96.8 → 92.4. We've flagged this with the vendor; users on Tier 2 routing have been auto-rebalanced toward Claude/GPT-4o for compliance domain.

IMPROVEMENTMay 22, 2026

DeepSeek V3 closed the security-code gap to within 2 points.

DeepSeek's code-security hallucination rate has fallen from 9.1% to 4.8% over 12 weeks. The model now leads on API method-signature accuracy by a measurable margin.

STABLEMay 19, 2026

Claude's lead on legal citation has held flat for 8 weeks.

Claude 4.5 has maintained a 3+ point lead on legal-domain accuracy since its release. The gap is widest on California and New York case-law citations — narrower on federal.

CHALLENGERMay 17, 2026

Grok's dissent rate has risen 6 points — and so has its concession rate.

Grok 4 challenges consensus more aggressively after its recent training refresh (now 44% of rounds, up from 38%). But it also concedes more readily — concession rate up from 64% to 71% within two rounds.

EMERGENCEMay 14, 2026

Qwen 3 Max is the fastest-improving panel member.

Up 4.7 points in two weeks since being added to the panel. Currently weighted at 5% of Tier 1 routing while it gathers panel-history data.

CROSS-DOMAINMay 11, 2026

All models fail more often on novel jurisdiction combinations.

When questions span multiple unfamiliar regulatory regimes (crypto + EU + APAC), agreement increases while accuracy falls. This is the most dangerous panel state — high consensus, low survival under expert review.

How we measure failure.

Hallucination rate

A claim counts as a hallucination if (a) the model asserted it, (b) at least 3 of the other 4 panel members rejected it, (c) the assertion was not rescued by panel-supplied evidence in subsequent rounds, and (d) human reviewers confirmed the rejection on a randomized 1% audit sample.

This is a stricter standard than self-reported softmax confidence — and far stricter than retrieval-only benchmark sets where the test data is publicly known.

Failure mode catalog

Failure modes are computed by clustering rejected claims by content, then human-naming the cluster. The catalog is open and auditable: see github.com/bridge-protocol/failure-modes.

Failure modes are versioned and timestamped. Models that improve on a specific failure mode after a vendor update are surfaced in the drift analysis section.

DOWNLOAD METHODOLOGY (PDF · 48p) → ↻ See the leaderboard