Real-world performance data from adversarial verification. No marketing benchmarks. No vendor-supplied numbers. Just every model's score against every other model in the field.
| RANK | MODEL | SURVIVAL | WIN RATE | LATENCY · p50 | VERIFICATIONS | TREND |
|---|---|---|---|---|---|---|
| 01 | Claude Sonnet 4.5ANTHROPIC | 96.1 | 71.4% | 2.3s | 184,032 | ▲ +1.2 |
| 02 | GPT-4oOPENAI | 93.4 | 65.2% | 1.9s | 192,471 | ▲ +0.8 |
| 03 | Gemini 2.5 ProGOOGLE | 91.0 | 61.8% | 2.7s | 156,089 | ▼ −0.4 |
| 04 | DeepSeek V3DEEPSEEK | 88.7 | 57.9% | 1.6s | 104,233 | ▲ +2.1 |
| 05 | Grok 4xAI | 85.3 | 54.1% | 2.4s | 98,447 | ● 0.0 |
| 06 | Llama 4 ScoutMETA | 82.6 | 49.8% | 1.2s | 67,891 | ▲ +0.5 |
| 07 | Mistral Large 3MISTRAL | 79.4 | 46.2% | 1.8s | 34,182 | ▼ −1.1 |
| 08 | Cohere Command R+COHERE | 76.8 | 43.1% | 2.1s | 9,734 | ▲ +0.3 |
| 09 | Qwen 3 MaxALIBABA | 73.2 | 38.7% | 1.4s | 122 | ▲ +4.7 |
Red = high disagreement, green = consistent. This is where the value of multi-model verification lives.
| CLAUDE | GPT-4o | GEMINI | GROK | DEEPSEEK | |
|---|---|---|---|---|---|
| CLAUDE | — | 7% | 9% | 31% | 14% |
| GPT-4o | 7% | — | 10% | 28% | 16% |
| GEMINI | 9% | 10% | — | 19% | 17% |
| GROK | 31% | 28% | 19% | — | 24% |
| DEEPSEEK | 14% | 16% | 17% | 24% | — |
Following a vendor model refresh, Gemini 2.5 Pro's score on EU framework citations fell from 96.8 to 92.4. Disagreement rate with Claude on GDPR Art. 35 nearly doubled. We've flagged this with Google's API team.
Cross-domain agreement varies more than aggregate scores suggest. The strongest correlation: both models trained heavily on open-source code. The weakest: medical training data diverges by training cohort.
Grok 4 has a 38% dissent rate — twice the panel average. Of those dissents, it concedes 64% of the time within two rounds. This makes it a structurally valuable panel member despite its lower aggregate score.
Survival score gain of 8.4 points over 12 weeks — the largest improvement in our tracking history. Notable strength: security-focused code review, where it now leads the panel.
When contract clauses span multiple unfamiliar regulatory regimes (e.g., crypto + EU + APAC), agreement increases while accuracy falls. This is the most dangerous panel state — high consensus, low survival under expert audit.
Six months ago this was over 4s. Today, p50 latency across the panel ranges from 1.2s (Llama 4 Scout) to 2.7s (Gemini 2.5 Pro). Tier 2 verifications consistently complete under 5s end-to-end.
Adversarial survival is computed as a weighted combination of (1) initial panel agreement on a finding, (2) position stability through debate, (3) strength of cited evidence at convergence, minus penalties for (4) rounds required to reach consensus and (5) concessions made under adversarial pressure.
The result is normalized 0–100. A score of 95+ means the model produces claims that survive aggressive challenge from four other frontier models with minimal concession. A score of 80 means the model produces useful claims but defends them less robustly.