Q2 2026 · PUBLISHED MAY 27, 2026

Frontier AI Performance. Measured by Consensus.

The only benchmark computed from real production workloads across real model panels with real adversarial verification. Not synthetic. Not vendor-supplied. Not gameable.

Five findings from the quarter.

FINDING 01

Claude Sonnet 4.5 leads legal analysis for the third consecutive quarter.

97.2 survival score · 3.0 pp lead over GPT-4o · n = 142K

FINDING 02

GPT-4o's code review accuracy improved 4.2% from Q1.

Largest single-quarter gain across the panel · n = 192K

FINDING 03

Cross-model disagreement on medical queries remains highest at 34%.

Twice the rate of legal · 3× the rate of code · n = 68K

FINDING 04

DeepSeek V3 entered the top 3 for financial analysis.

Up from #6 in Q1 · +8.4 points over 12 weeks · n = 104K

FINDING 05

Average consensus confidence across all domains: 91.7%.

Up 1.2 pp from Q1 · stable disagreement rate · n = 847K

Get the full Q2 2026 report.

32 pages. Per-model per-domain breakdowns. Methodology appendix. Drift commentary. Cite-ready.

Quarterly cadence. Unsubscribe any time. Never shared.

The panel.

RANK MODEL SURVIVAL WIN RATE DEBATES WON DISSENT LATENCY · p50 TREND Q1→Q2
01Claude Sonnet 4.5ANTHROPIC96.171.4%184K12.4%2.3s▲ +1.2
02GPT-4oOPENAI93.465.2%192K14.7%1.9s▲ +0.8
03Gemini 2.5 ProGOOGLE91.061.8%156K17.2%2.7s▼ −0.4
04DeepSeek V3DEEPSEEK88.757.9%104K22.4%1.6s▲ +2.1
05Grok 4xAI85.354.1%98K38.1%2.4s● 0.0
06Llama 4 ScoutMETA82.649.8%67K24.7%1.2s▲ +0.5
07Mistral Large 3MISTRAL79.446.2%34K21.6%1.8s▼ −1.1
08Cohere Command R+COHERE76.843.1%9K27.3%2.1s▲ +0.3
09Qwen 3 MaxALIBABA73.238.7%0.1K31.4%1.4s▲ +4.7
847K AI decisions verified · 9 models tracked · 7 providers Quick-glance version → /leaderboard

Six domains. Six different winners.

A "best AI" doesn't exist. The best model varies by what you're asking it to do.

Legal Analysis

142K verifications
  • Claude 4.597.2
  • GPT-4o94.1
  • Gemini 2.588.4
  • DeepSeek V387.1
  • Grok 479.6
Leader: Claude Sonnet 4.5 · Biggest move: DeepSeek V3 (+3.4 pp QoQ)

Code Review

214K verifications
  • DeepSeek V394.8
  • Claude 4.593.2
  • GPT-4o90.7
  • Gemini 2.584.3
  • Grok 481.5
Leader: DeepSeek V3 · Biggest move: GPT-4o (+4.2 pp QoQ)

Claims · Fact-Check

87K verifications
  • Claude 4.593.7
  • Gemini 2.591.6
  • GPT-4o88.9
  • DeepSeek V381.2
  • Grok 477.4
Leader: Claude · Biggest move: Gemini (−1.3 pp QoQ)

Business · Strategy

156K verifications
  • GPT-4o92.0
  • Claude 4.590.4
  • Gemini 2.588.1
  • Grok 485.7
  • DeepSeek V382.8
Leader: GPT-4o · Most consistent: 10-point spread (smallest of any domain)

Healthcare

68K verifications
  • Claude 4.595.1
  • Gemini 2.590.8
  • GPT-4o88.2
  • DeepSeek V378.1
  • Grok 473.9
Leader: Claude · Largest gap of any domain (21 points top-to-bottom)

Compliance

180K verifications
  • Gemini 2.596.4
  • Claude 4.594.8
  • GPT-4o90.3
  • DeepSeek V386.7
  • Grok 482.4

Where AI is least reliable.

Cross-model disagreement rate by pair. Red cells are where you need verification most. This is the data no single model lab can produce — because they see only their own output.

CLAUDEGPT-4oGEMINIGROKDEEPSEEK
CLAUDE7%9%31%14%
GPT-4o7%10%28%16%
GEMINI9%10%19%17%
GROK31%28%19%24%
DEEPSEEK14%16%17%24%
Low < 12% Moderate 12–24% High > 24%

Who's getting better. Who isn't.

// SURVIVAL · QUARTERLY · Q3 2025 → Q2 2026
Claude GPT-4o Gemini Grok DeepSeek
9792878277
Q3 '25Q4 '25Q1 '26Q2 '26

How the Index is computed. In plain text.

Adversarial survival scoring

A claim "survives" when it withstands adversarial challenge from the four other panel members across one or more debate rounds. The survival score is computed as a weighted combination of initial agreement, position stability through debate, strength of cited evidence at convergence, minus penalties for rounds required and concessions made.

Normalized 0–100. The score is computed only from observed panel runs. No fixed test set. No model can submit its own score. Full methodology →

What the Index is not

  • Not a softmax probability. Self-reported model confidence is excluded entirely.
  • Not a static benchmark. No fixed test set. The score reflects production verification workloads.
  • Not vendor-supplied. No model provider can submit data. Every datapoint comes from observed BRIDGE panel runs.
  • Not gameable. Models can't see the panel context they're competing against — debate evidence is structured.
  • Not retrieval-only. Tasks involve generation under adversarial challenge.
survival_score.formula
# Adversarial survival score · weighted combination
survival = w1 · initial_agreement
         + w2 · position_stability_through_debate
         + w3 · evidence_strength_at_convergence
         − w4 · rounds_required
         − w5 · concessions_made

# Minimum sample size for publication: n >= 100 per (model, domain, quarter)
# Weights published in full methodology document
# Independence enforced by structural anonymization at debate time
DOWNLOAD METHODOLOGY (PDF · 48p) ↻ Failure-mode appendix

The Index is free. The data underneath is licensable.

The quarterly summary on this page is published openly. Disaggregated performance data, historical trends, and custom analytical queries are available for research institutions, AI labs, and enterprise teams.

RESEARCH INSTITUTIONS

Research Access

Full disaggregated data, historical trends, anonymized query patterns, co-publication opportunities. For universities, AI safety labs, policy think tanks, government offices.

Apply for Research Access →
ENTERPRISE

Enterprise Intelligence

Custom analytics dashboards. Real-time monitoring. Alerting when model accuracy drops below threshold. Industry-specific intelligence.

See Enterprise Intelligence →
AI LABS · FRONTIER

Lab Partnership

Real-time data feeds. Custom benchmark design. Pre-publication review. Joint research programs with frontier labs. High-touch engagement.

Open conversation →

How to cite the Index.

BRIDGE Protocol. (2026). The BRIDGE Index Q2 2026: Frontier AI Performance Measured by Consensus. Retrieved from https://getbridge.dev/index

Licensed under CC BY-NC 4.0 for academic and non-commercial use. Commercial reuse requires written permission. DOI assignment pending — provisional citation above.