Claude Sonnet 4.5 leads legal analysis for the third consecutive quarter.
97.2 survival score · 3.0 pp lead over GPT-4o · n = 142K
The only benchmark computed from real production workloads across real model panels with real adversarial verification. Not synthetic. Not vendor-supplied. Not gameable.
97.2 survival score · 3.0 pp lead over GPT-4o · n = 142K
Largest single-quarter gain across the panel · n = 192K
Twice the rate of legal · 3× the rate of code · n = 68K
Up from #6 in Q1 · +8.4 points over 12 weeks · n = 104K
Up 1.2 pp from Q1 · stable disagreement rate · n = 847K
32 pages. Per-model per-domain breakdowns. Methodology appendix. Drift commentary. Cite-ready.
| RANK | MODEL | SURVIVAL | WIN RATE | DEBATES WON | DISSENT | LATENCY · p50 | TREND Q1→Q2 |
|---|---|---|---|---|---|---|---|
| 01 | Claude Sonnet 4.5ANTHROPIC | 96.1 | 71.4% | 184K | 12.4% | 2.3s | ▲ +1.2 |
| 02 | GPT-4oOPENAI | 93.4 | 65.2% | 192K | 14.7% | 1.9s | ▲ +0.8 |
| 03 | Gemini 2.5 ProGOOGLE | 91.0 | 61.8% | 156K | 17.2% | 2.7s | ▼ −0.4 |
| 04 | DeepSeek V3DEEPSEEK | 88.7 | 57.9% | 104K | 22.4% | 1.6s | ▲ +2.1 |
| 05 | Grok 4xAI | 85.3 | 54.1% | 98K | 38.1% | 2.4s | ● 0.0 |
| 06 | Llama 4 ScoutMETA | 82.6 | 49.8% | 67K | 24.7% | 1.2s | ▲ +0.5 |
| 07 | Mistral Large 3MISTRAL | 79.4 | 46.2% | 34K | 21.6% | 1.8s | ▼ −1.1 |
| 08 | Cohere Command R+COHERE | 76.8 | 43.1% | 9K | 27.3% | 2.1s | ▲ +0.3 |
| 09 | Qwen 3 MaxALIBABA | 73.2 | 38.7% | 0.1K | 31.4% | 1.4s | ▲ +4.7 |
A "best AI" doesn't exist. The best model varies by what you're asking it to do.
Cross-model disagreement rate by pair. Red cells are where you need verification most. This is the data no single model lab can produce — because they see only their own output.
| CLAUDE | GPT-4o | GEMINI | GROK | DEEPSEEK | |
|---|---|---|---|---|---|
| CLAUDE | — | 7% | 9% | 31% | 14% |
| GPT-4o | 7% | — | 10% | 28% | 16% |
| GEMINI | 9% | 10% | — | 19% | 17% |
| GROK | 31% | 28% | 19% | — | 24% |
| DEEPSEEK | 14% | 16% | 17% | 24% | — |
A claim "survives" when it withstands adversarial challenge from the four other panel members across one or more debate rounds. The survival score is computed as a weighted combination of initial agreement, position stability through debate, strength of cited evidence at convergence, minus penalties for rounds required and concessions made.
Normalized 0–100. The score is computed only from observed panel runs. No fixed test set. No model can submit its own score. Full methodology →
# Adversarial survival score · weighted combination survival = w1 · initial_agreement + w2 · position_stability_through_debate + w3 · evidence_strength_at_convergence − w4 · rounds_required − w5 · concessions_made # Minimum sample size for publication: n >= 100 per (model, domain, quarter) # Weights published in full methodology document # Independence enforced by structural anonymization at debate time▌
The quarterly summary on this page is published openly. Disaggregated performance data, historical trends, and custom analytical queries are available for research institutions, AI labs, and enterprise teams.
Full disaggregated data, historical trends, anonymized query patterns, co-publication opportunities. For universities, AI safety labs, policy think tanks, government offices.
Apply for Research Access →Custom analytics dashboards. Real-time monitoring. Alerting when model accuracy drops below threshold. Industry-specific intelligence.
See Enterprise Intelligence →Real-time data feeds. Custom benchmark design. Pre-publication review. Joint research programs with frontier labs. High-touch engagement.
Open conversation →BRIDGE Protocol. (2026). The BRIDGE Index Q2 2026: Frontier AI Performance Measured by Consensus. Retrieved from https://getbridge.dev/index
Licensed under CC BY-NC 4.0 for academic and non-commercial use. Commercial reuse requires written permission. DOI assignment pending — provisional citation above.