Where AI fails. Where it doesn't.

Real-world performance data from adversarial verification. No marketing benchmarks. No vendor-supplied numbers. Just every model's score against every other model in the field.

VERIFICATIONS · 7-DAY 847,201
MODELS TRACKED 9
PROVIDERS 7
UPDATED 14s ago

The score is how much a claim survives when four other models try to break it.

DOMAIN
WINDOW
RANK MODEL SURVIVAL WIN RATE LATENCY · p50 VERIFICATIONS TREND
01 Claude Sonnet 4.5ANTHROPIC 96.1 71.4% 2.3s 184,032 ▲ +1.2
02 GPT-4oOPENAI 93.4 65.2% 1.9s 192,471 ▲ +0.8
03 Gemini 2.5 ProGOOGLE 91.0 61.8% 2.7s 156,089 ▼ −0.4
04 DeepSeek V3DEEPSEEK 88.7 57.9% 1.6s 104,233 ▲ +2.1
05 Grok 4xAI 85.3 54.1% 2.4s 98,447 ● 0.0
06 Llama 4 ScoutMETA 82.6 49.8% 1.2s 67,891 ▲ +0.5
07 Mistral Large 3MISTRAL 79.4 46.2% 1.8s 34,182 ▼ −1.1
08 Cohere Command R+COHERE 76.8 43.1% 2.1s 9,734 ▲ +0.3
09 Qwen 3 MaxALIBABA 73.2 38.7% 1.4s 122 ▲ +4.7
Survival score = adversarial agreement weighted by position stability through debate. Methodology → Win rate = how often this model's initial position is sustained through debate.

Different models. Different competencies.

Contracts

142K verifications
  • Claude 4.597.2
  • GPT-4o94.1
  • Gemini 2.588.4
  • Grok 479.6
  • DeepSeek V387.1
Insight: Claude leads on US case-law citations; GPT-4o leads on EU framework alignment.

Code Review

214K verifications
  • DeepSeek V394.8
  • Claude 4.593.2
  • GPT-4o90.7
  • Gemini 2.584.3
  • Grok 481.5
Insight: DeepSeek dominates on security findings; Claude best on architecture reasoning.

Claims

87K verifications
  • Claude 4.593.7
  • Gemini 2.591.6
  • GPT-4o88.9
  • DeepSeek V381.2
  • Grok 477.4
Insight: Grok is the most willing to challenge claims — high dissent rate, lower agreement.

Business

156K verifications
  • GPT-4o92.0
  • Claude 4.590.4
  • Gemini 2.588.1
  • Grok 485.7
  • DeepSeek V382.8
Insight: All models within 10 points — business analysis is the most consensus-friendly domain.

Healthcare

68K verifications
  • Claude 4.595.1
  • Gemini 2.590.8
  • GPT-4o88.2
  • DeepSeek V378.1
  • Grok 473.9
Insight: Largest model gap of any domain (21 points). Claude is the runaway leader.

Compliance

180K verifications
  • Gemini 2.596.4
  • Claude 4.594.8
  • GPT-4o90.3
  • DeepSeek V386.7
  • Grok 482.4

Where models split. Where they don't.

Red = high disagreement, green = consistent. This is where the value of multi-model verification lives.

CLAUDE GPT-4o GEMINI GROK DEEPSEEK
CLAUDE 7% 9% 31% 14%
GPT-4o 7% 10% 28% 16%
GEMINI 9% 10% 19% 17%
GROK 31% 28% 19% 24%
DEEPSEEK 14% 16% 17% 24%
Low < 12% Moderate 12–24% High > 24%

Who's getting better. Who isn't.

// SURVIVAL SCORE · WEEKLY · 12W
Claude GPT-4o Gemini Grok DeepSeek
9792878277
W1W3W5W7W9W11W12

Patterns the data revealed this week.

REGRESSIONMay 24, 2026

Gemini's compliance accuracy dropped 4 points this week.

Following a vendor model refresh, Gemini 2.5 Pro's score on EU framework citations fell from 96.8 to 92.4. Disagreement rate with Claude on GDPR Art. 35 nearly doubled. We've flagged this with Google's API team.

n = 12,847 verifications · confidence 0.97
CONSENSUSMay 22, 2026

Claude and DeepSeek agree 97% on code — but only 81% on medical.

Cross-domain agreement varies more than aggregate scores suggest. The strongest correlation: both models trained heavily on open-source code. The weakest: medical training data diverges by training cohort.

n = 38,492 cross-domain verifications
CHALLENGERMay 19, 2026

Grok challenges consensus more than any other model.

Grok 4 has a 38% dissent rate — twice the panel average. Of those dissents, it concedes 64% of the time within two rounds. This makes it a structurally valuable panel member despite its lower aggregate score.

n = 98,447 verifications · debate rate 38%
EMERGENCEMay 16, 2026

DeepSeek V3 is closing the gap on Western models faster than any other entrant.

Survival score gain of 8.4 points over 12 weeks — the largest improvement in our tracking history. Notable strength: security-focused code review, where it now leads the panel.

n = 104,233 verifications · trend +2.1/wk
FAILURE MODEMay 13, 2026

All five models fail similarly on novel jurisdiction interactions.

When contract clauses span multiple unfamiliar regulatory regimes (e.g., crypto + EU + APAC), agreement increases while accuracy falls. This is the most dangerous panel state — high consensus, low survival under expert audit.

n = 2,309 multi-regime cases · flagged for manual review
LATENCYMay 10, 2026

Latency gap between fastest and slowest model has narrowed to 1.5s.

Six months ago this was over 4s. Today, p50 latency across the panel ranges from 1.2s (Llama 4 Scout) to 2.7s (Gemini 2.5 Pro). Tier 2 verifications consistently complete under 5s end-to-end.

How we compute the score. In one paragraph.

The score

Adversarial survival is computed as a weighted combination of (1) initial panel agreement on a finding, (2) position stability through debate, (3) strength of cited evidence at convergence, minus penalties for (4) rounds required to reach consensus and (5) concessions made under adversarial pressure.

The result is normalized 0–100. A score of 95+ means the model produces claims that survive aggressive challenge from four other frontier models with minimal concession. A score of 80 means the model produces useful claims but defends them less robustly.

What it isn't

  • Not a softmax probability. Self-reported confidence numbers from models are excluded entirely.
  • Not a static benchmark. No fixed test set. The score reflects production verifications, weighted to match the public distribution of BRIDGE workloads.
  • Not vendor-supplied. No model provider can submit scores. Every datapoint comes from observed BRIDGE panel runs.
  • Not gamed-able. Models can't see the panel context they're competing against — debate evidence is structured, not free-form.
DOWNLOAD WHITEPAPER (PDF · 24p) Get the raw data →