Model Benchmarks

Is your agent still on the right model?

You picked a model when you shipped. Three frontier models have launched since. Converra re-runs the bake-off against your actual scenarios and opens a PR with the winner.

Eval tools tell you which model scored better. Converra opens the PR.

Quality Speed Cost Value

You've done this once. In a spreadsheet.

Three models, last quarter, one engineer for two days. The test cases went stale. The models changed. The Notion doc nobody reopened.

From prompt to PR in under 20 minutes

Step 1

Pick a goal

Quality, speed, cost, or best value. Converra selects challenger models based on your goal, with provider diversity enforced across OpenAI, Anthropic, and Google.

Step 2

Scenarios generated from your prompt

Test scenarios across three difficulty levels — easy, medium, and hard. Hard scenarios include conflicting constraints mid-conversation to test how models handle pressure.

Step 3

Every model runs every scenario three times

5-turn conversations, median scoring to eliminate outliers. Baseline runs first, challengers run in parallel. Real cost and latency measured from actual API calls.

Step 4

Winner selected, PR opened

Converra picks the winner based on your goal and opens a GitHub PR with the model config change. The PR includes a full comparison table — quality, cost, and latency across all difficulty levels.

Five-turn conversations. Billed tokens. Wall-clock latency.

Eight scenarios across three difficulty tiers, every model run three times for median scoring. Quality floors enforced so a cheap model can't win by being incoherent.

Booking Agent · Benchmark
5 models · 8 scenarios · 3 runs each
Goal: Quality
ModelQualityΔ vs baseLatencyCost / 1kVerdict
Claude Opus 4.7
Anthropic
87.4+6.21840ms$0.075Winner
GPT-5
OpenAI
84.9+3.71520ms$0.050
Gemini 3 Pro
Google
82.1+0.9980ms$0.020
Claude Sonnet 4.6
Anthropic
81.2baseline1190ms$0.018Current
GPT-5 Mini
OpenAI
73.6-7.6720ms$0.004
Quality floors enforced (easy 75 / med 55 / hard 35)·Median scoring across 3 runs·Real cost & latency from API calls

The dashboard isn't the deliverable. The PR is.

The winner becomes a config change with the full comparison attached. Review it like any other PR. Merge when you're ready.

Open

Switch booking agent model: Sonnet 4.6 → Opus 4.7

converra-bot wants to merge 1 commit into main · +6.2 quality / +650ms / +$0.057 per 1k tokens

config/agents/booking.yaml
-model: claude-sonnet-4-6
+model: claude-opus-4-7

Benchmark run found Claude Opus 4.7 beats the current model on the Quality goal. Held the floor on every difficulty tier.

Easy
93.1 → 96.8 +3.7
Medium
78.4 → 86.1 +7.7
Hard
62.7 → 70.5 +7.8
All 3 checks passed· regression / cost-floor / quality-flooropened by converra-bot · 2m ago

Different goal, different winner

Same prompt, same scenarios. Pick what matters — Converra returns the model that wins for that target.

Winner · Quality
Claude Opus 4.7
Anthropic
+6.2
quality vs. baseline

Premium ranked by median score across all three difficulty tiers.

Easy
96.8
Medium
86.1
Hard
70.5
Same prompt, same scenarios — different goal, different winner.

Frequently asked questions

How is this different from Braintrust, LangSmith, or Galileo?

Eval tools end at a dashboard. Converra ends at a PR. We generate the scenarios, run the head-to-head, pick the winner under your goal constraint, and open a config-change PR with the comparison table attached. You don't write evals, you don't interpret a graph, you just review the PR.

How do you connect to my agent?

Two paths. (1) Connect via API/SDK — Converra calls your prompt through OpenAI/Anthropic/Google directly, swapping the model header. No code change in your stack. (2) GitHub integration — point Converra at the file containing your model config, and the PR lands in that repo. Either takes under 5 minutes.

What if the winner is more expensive?

Set a cost ceiling per benchmark. The Cost and Value goals enforce it natively — quality wins are still surfaced but excluded from the recommendation. The PR description shows estimated monthly cost impact at your current volume so you can decide with real numbers.

Does this work with custom tools, function calls, or RAG?

Yes. Scenarios run against your full agent loop, not just the LLM call. If your prompt invokes tools, the simulated personas exercise them. Models that fail at tool-calling under pressure get caught in the medium and hard scenarios.

How long does a benchmark take, and what does it cost?

3 to 18 minutes wall-clock, depending on how many models you compare. Token cost is the actual API spend across runs (typically $0.50–$3 per benchmark, surfaced before you start). Baseline runs first, challengers run in parallel.

How many models, which providers?

2 to 6 models per run. OpenAI, Anthropic, and Google. Provider diversity enforced — no more than 2 models from any single provider, so you don't end up comparing GPT-5 to GPT-5 Mini and calling that a benchmark.

Do I need to write test cases?

No. Converra generates scenarios from your agent's prompt across easy, medium, and hard difficulty tiers. Hard scenarios include conflicting constraints mid-conversation to test robustness, not just happy path.

Can I inspect individual benchmark conversations?

Yes. Every turn of every run is persisted. Read the full transcript to see exactly where a model succeeded or failed — useful when the score is close and you want to understand the qualitative difference.

If a better model exists for your prompt, the PR will be waiting.

Connect your agent, pick a goal, run your first benchmark in minutes.