You picked a model when you shipped. Three frontier models have launched since. Converra re-runs the bake-off against your actual scenarios and opens a PR with the winner.
Eval tools tell you which model scored better. Converra opens the PR.
Three models, last quarter, one engineer for two days. The test cases went stale. The models changed. The Notion doc nobody reopened.
Quality, speed, cost, or best value. Converra selects challenger models based on your goal, with provider diversity enforced across OpenAI, Anthropic, and Google.
Test scenarios across three difficulty levels — easy, medium, and hard. Hard scenarios include conflicting constraints mid-conversation to test how models handle pressure.
5-turn conversations, median scoring to eliminate outliers. Baseline runs first, challengers run in parallel. Real cost and latency measured from actual API calls.
Converra picks the winner based on your goal and opens a GitHub PR with the model config change. The PR includes a full comparison table — quality, cost, and latency across all difficulty levels.
Eight scenarios across three difficulty tiers, every model run three times for median scoring. Quality floors enforced so a cheap model can't win by being incoherent.
| Model | Quality | Δ vs base | Latency | Cost / 1k | Verdict |
|---|---|---|---|---|---|
Claude Opus 4.7 Anthropic | 87.4 | +6.2 | 1840ms | $0.075 | Winner |
GPT-5 OpenAI | 84.9 | +3.7 | 1520ms | $0.050 | — |
Gemini 3 Pro Google | 82.1 | +0.9 | 980ms | $0.020 | — |
Claude Sonnet 4.6 Anthropic | 81.2 | baseline | 1190ms | $0.018 | Current |
GPT-5 Mini OpenAI | 73.6 | -7.6 | 720ms | $0.004 | — |
The winner becomes a config change with the full comparison attached. Review it like any other PR. Merge when you're ready.
converra-bot wants to merge 1 commit into main · +6.2 quality / +650ms / +$0.057 per 1k tokens
Benchmark run found Claude Opus 4.7 beats the current model on the Quality goal. Held the floor on every difficulty tier.
Same prompt, same scenarios. Pick what matters — Converra returns the model that wins for that target.
Premium ranked by median score across all three difficulty tiers.
Eval tools end at a dashboard. Converra ends at a PR. We generate the scenarios, run the head-to-head, pick the winner under your goal constraint, and open a config-change PR with the comparison table attached. You don't write evals, you don't interpret a graph, you just review the PR.
Two paths. (1) Connect via API/SDK — Converra calls your prompt through OpenAI/Anthropic/Google directly, swapping the model header. No code change in your stack. (2) GitHub integration — point Converra at the file containing your model config, and the PR lands in that repo. Either takes under 5 minutes.
Set a cost ceiling per benchmark. The Cost and Value goals enforce it natively — quality wins are still surfaced but excluded from the recommendation. The PR description shows estimated monthly cost impact at your current volume so you can decide with real numbers.
Yes. Scenarios run against your full agent loop, not just the LLM call. If your prompt invokes tools, the simulated personas exercise them. Models that fail at tool-calling under pressure get caught in the medium and hard scenarios.
3 to 18 minutes wall-clock, depending on how many models you compare. Token cost is the actual API spend across runs (typically $0.50–$3 per benchmark, surfaced before you start). Baseline runs first, challengers run in parallel.
2 to 6 models per run. OpenAI, Anthropic, and Google. Provider diversity enforced — no more than 2 models from any single provider, so you don't end up comparing GPT-5 to GPT-5 Mini and calling that a benchmark.
No. Converra generates scenarios from your agent's prompt across easy, medium, and hard difficulty tiers. Hard scenarios include conflicting constraints mid-conversation to test robustness, not just happy path.
Yes. Every turn of every run is persisted. Read the full transcript to see exactly where a model succeeded or failed — useful when the score is close and you want to understand the qualitative difference.
Connect your agent, pick a goal, run your first benchmark in minutes.