The BRIDGE Index · Q2 2026 · Frontier AI Performance

Five findings from the quarter.

FINDING 01

Claude Sonnet 4.5 leads legal analysis for the third consecutive quarter.

97.2 survival score · 3.0 pp lead over GPT-4o · n = 142K

FINDING 02

GPT-4o's code review accuracy improved 4.2% from Q1.

Largest single-quarter gain across the panel · n = 192K

FINDING 03

Cross-model disagreement on medical queries remains highest at 34%.

Twice the rate of legal · 3× the rate of code · n = 68K

FINDING 04

DeepSeek V3 entered the top 3 for financial analysis.

Up from #6 in Q1 · +8.4 points over 12 weeks · n = 104K

FINDING 05

Average consensus confidence across all domains: 91.7%.

Up 1.2 pp from Q1 · stable disagreement rate · n = 847K

Get the full Q2 2026 report.

32 pages. Per-model per-domain breakdowns. Methodology appendix. Drift commentary. Cite-ready.

Quarterly cadence. Unsubscribe any time. Never shared.

The panel.

RANK	MODEL	SURVIVAL	WIN RATE	DEBATES WON	DISSENT	LATENCY · p50	TREND Q1→Q2
01	Claude Sonnet 4.5ANTHROPIC	96.1	71.4%	184K	12.4%	2.3s	▲ +1.2
02	GPT-4oOPENAI	93.4	65.2%	192K	14.7%	1.9s	▲ +0.8
03	Gemini 2.5 ProGOOGLE	91.0	61.8%	156K	17.2%	2.7s	▼ −0.4
04	DeepSeek V3DEEPSEEK	88.7	57.9%	104K	22.4%	1.6s	▲ +2.1
05	Grok 4xAI	85.3	54.1%	98K	38.1%	2.4s	● 0.0
06	Llama 4 ScoutMETA	82.6	49.8%	67K	24.7%	1.2s	▲ +0.5
07	Mistral Large 3MISTRAL	79.4	46.2%	34K	21.6%	1.8s	▼ −1.1
08	Cohere Command R+COHERE	76.8	43.1%	9K	27.3%	2.1s	▲ +0.3
09	Qwen 3 MaxALIBABA	73.2	38.7%	0.1K	31.4%	1.4s	▲ +4.7

847K AI decisions verified · 9 models tracked · 7 providers Quick-glance version → /leaderboard

Six domains. Six different winners.

A "best AI" doesn't exist. The best model varies by what you're asking it to do.

Legal Analysis

142K verifications

Claude 4.597.2
GPT-4o94.1
Gemini 2.588.4
DeepSeek V387.1
Grok 479.6

Code Review

214K verifications

DeepSeek V394.8
Claude 4.593.2
GPT-4o90.7
Gemini 2.584.3
Grok 481.5

Claims · Fact-Check

87K verifications

Claude 4.593.7
Gemini 2.591.6
GPT-4o88.9
DeepSeek V381.2
Grok 477.4

Business · Strategy

156K verifications

GPT-4o92.0
Claude 4.590.4
Gemini 2.588.1
Grok 485.7
DeepSeek V382.8

Healthcare

68K verifications

Claude 4.595.1
Gemini 2.590.8
GPT-4o88.2
DeepSeek V378.1
Grok 473.9

Compliance

180K verifications

Gemini 2.596.4
Claude 4.594.8
GPT-4o90.3
DeepSeek V386.7
Grok 482.4

Where AI is least reliable.

Cross-model disagreement rate by pair. Red cells are where you need verification most. This is the data no single model lab can produce — because they see only their own output.

	CLAUDE	GPT-4o	GEMINI	GROK	DEEPSEEK
CLAUDE	—	7%	9%	31%	14%
GPT-4o	7%	—	10%	28%	16%
GEMINI	9%	10%	—	19%	17%
GROK	31%	28%	19%	—	24%
DEEPSEEK	14%	16%	17%	24%	—

Low < 12% Moderate 12–24% High > 24%

Who's getting better. Who isn't.

// SURVIVAL · QUARTERLY · Q3 2025 → Q2 2026

Claude GPT-4o Gemini Grok DeepSeek

9792878277

Q3 '25Q4 '25Q1 '26Q2 '26

How the Index is computed. In plain text.

Adversarial survival scoring

A claim "survives" when it withstands adversarial challenge from the four other panel members across one or more debate rounds. The survival score is computed as a weighted combination of initial agreement, position stability through debate, strength of cited evidence at convergence, minus penalties for rounds required and concessions made.

Normalized 0–100. The score is computed only from observed panel runs. No fixed test set. No model can submit its own score. Full methodology →

What the Index is not

Not a softmax probability. Self-reported model confidence is excluded entirely.
Not a static benchmark. No fixed test set. The score reflects production verification workloads.
Not vendor-supplied. No model provider can submit data. Every datapoint comes from observed BRIDGE panel runs.
Not gameable. Models can't see the panel context they're competing against — debate evidence is structured.
Not retrieval-only. Tasks involve generation under adversarial challenge.

survival_score.formula

# Adversarial survival score · weighted combination
survival = w1 · initial_agreement
         + w2 · position_stability_through_debate
         + w3 · evidence_strength_at_convergence
         − w4 · rounds_required
         − w5 · concessions_made

# Minimum sample size for publication: n >= 100 per (model, domain, quarter)
# Weights published in full methodology document
# Independence enforced by structural anonymization at debate time▌

DOWNLOAD METHODOLOGY (PDF · 48p) → ↻ Failure-mode appendix

The Index is free. The data underneath is licensable.

The quarterly summary on this page is published openly. Disaggregated performance data, historical trends, and custom analytical queries are available for research institutions, AI labs, and enterprise teams.

RESEARCH INSTITUTIONS

Research Access

Full disaggregated data, historical trends, anonymized query patterns, co-publication opportunities. For universities, AI safety labs, policy think tanks, government offices.

Apply for Research Access →

ENTERPRISE

Enterprise Intelligence

Custom analytics dashboards. Real-time monitoring. Alerting when model accuracy drops below threshold. Industry-specific intelligence.

See Enterprise Intelligence →

AI LABS · FRONTIER

Lab Partnership

Real-time data feeds. Custom benchmark design. Pre-publication review. Joint research programs with frontier labs. High-touch engagement.

Open conversation →

How to cite the Index.

BRIDGE Protocol. (2026). The BRIDGE Index Q2 2026: Frontier AI Performance Measured by Consensus. Retrieved from https://getbridge.dev/index

Licensed under CC BY-NC 4.0 for academic and non-commercial use. Commercial reuse requires written permission. DOI assignment pending — provisional citation above.