Model Benchmarks

Model benchmarking that ends with a pull request

Every team builds a model comparison spreadsheet. Nobody maintains it. Converra generates scenarios from your prompt, runs every model head-to-head, and opens a PR when it finds a better option.

How model benchmarking works

1

Pick a goal

Quality, speed, cost, or best value. Converra selects challenger models based on your goal, with provider diversity enforced across OpenAI, Anthropic, and Google.

2

Scenarios generated from your prompt

Test scenarios across three difficulty levels — easy, medium, and hard. Hard scenarios include conflicting constraints mid-conversation to test how models handle pressure.

3

Every model runs every scenario three times

5-turn conversations, median scoring to eliminate outliers. Baseline runs first, challengers run in parallel. Real cost and latency measured from actual API calls.

4

Winner selected, PR opened

Converra picks the winner based on your goal and opens a GitHub PR with the model config change. The PR includes a full comparison table — quality, cost, and latency across all difficulty levels.

You've built this before

An internal script, a spreadsheet, maybe a notebook that compared three models last quarter. The test cases went stale. The models changed. The results sat in a Notion doc nobody reopened.

Converra keeps the comparison current and ships the result as a PR. Scenarios stay fresh because they're generated from your prompt. Results stay actionable because they end in a deployable config change.

What you get from each benchmark

Per-difficulty breakdown

See how each model handles easy, medium, and hard scenarios separately. A model that aces simple questions might fall apart on complex ones.

Measured cost and latency

Real numbers from real API calls. Input and output token pricing, average response time — measured during the benchmark run.

Auto-PR with comparison table

When a winner is found, Converra opens a GitHub PR with the model switch and a full results table. One-click merge.

Quality floor protection

Cost-optimized models still have to clear performance thresholds. Easy scenarios require 75%+ quality, medium 55%+, hard 35%+.

Four ways to benchmark

Best Quality

Premium models ranked by median quality score. When agent performance is the only thing that matters.

Fastest

Models ranked by measured latency. For agents where response time drives user experience.

Lowest Cost

Cheapest model that clears quality thresholds across all difficulty levels. Same results, lower bill.

Best Value

One model from each tier — premium, balanced, economy. Best quality-to-cost ratio wins.

Public benchmarks test public prompts

A model that ranks third on a leaderboard might rank first for your booking agent, your triage bot, your intake flow. Converra runs your scenarios, with your personas, at your difficulty levels. The ranking that matters is the one built from your actual use case.

Frequently asked questions

How long does a benchmark take?

3 to 18 minutes, depending on how many models you compare. Baseline runs first (~1 minute), then challengers run in parallel.

How many models can I compare?

2 to 6. Converra auto-selects models based on your goal, or you can choose manually. Provider diversity is enforced — no more than 2 models from any single provider.

Which providers are supported?

OpenAI, Anthropic, and Google. Converra selects across providers to give you a balanced comparison.

Do I need to write test cases?

No. Converra generates scenarios from your agent's prompt, covering easy, medium, and hard difficulty levels. Hard scenarios include conflicting constraints that test robustness.

Can I inspect individual benchmark conversations?

Yes. Every conversation from every model run is persisted. You can read the full transcript to see exactly where a model succeeded or failed.

What happens when a winner is found?

Converra opens a GitHub PR with the model config change and a comparison table. The PR includes quality scores, cost, and latency across all difficulty levels. Merge when you're ready.

Find the right model for your agent

Connect your agent and run your first benchmark in minutes. Converra handles the scenarios, the scoring, and the PR.

Start for free