Name: Converra
Availability: InStock
Author: Converra

Question 1

How is this different from Braintrust, LangSmith, or Galileo?

Accepted Answer

Eval tools end at a dashboard. Converra ends at a PR. We generate the scenarios, run the head-to-head, pick the winner under your goal constraint, and open a config-change PR with the comparison table attached. You don't write evals, you don't interpret a graph, you just review the PR.

Question 2

How do you connect to my agent?

Accepted Answer

Two paths. (1) Connect via API/SDK — Converra calls your prompt through OpenAI/Anthropic/Google directly, swapping the model header. No code change in your stack. (2) GitHub integration — point Converra at the file containing your model config, and the PR lands in that repo. Either takes under 5 minutes.

Question 3

What if the winner is more expensive?

Accepted Answer

Set a cost ceiling per benchmark. The Cost and Value goals enforce it natively — quality wins are still surfaced but excluded from the recommendation. The PR description shows estimated monthly cost impact at your current volume so you can decide with real numbers.

Question 4

Does this work with custom tools, function calls, or RAG?

Accepted Answer

Yes. Scenarios run against your full agent loop, not just the LLM call. If your prompt invokes tools, the simulated personas exercise them. Models that fail at tool-calling under pressure get caught in the medium and hard scenarios.

Question 5

How long does a benchmark take, and what does it cost?

Accepted Answer

3 to 18 minutes wall-clock, depending on how many models you compare. Token cost is the actual API spend across runs (typically $0.50–$3 per benchmark, surfaced before you start). Baseline runs first, challengers run in parallel.

Question 6

How many models, which providers?

Accepted Answer

2 to 6 models per run. OpenAI, Anthropic, and Google. Provider diversity enforced — no more than 2 models from any single provider, so you don't end up comparing GPT-5 to GPT-5 Mini and calling that a benchmark.

Question 7

Do I need to write test cases?

Accepted Answer

No. Converra generates scenarios from your agent's prompt across easy, medium, and hard difficulty tiers. Hard scenarios include conflicting constraints mid-conversation to test robustness, not just happy path.

Question 8

Can I inspect individual benchmark conversations?

Accepted Answer

Yes. Every turn of every run is persisted. Read the full transcript to see exactly where a model succeeded or failed — useful when the score is close and you want to understand the qualitative difference.

Model	Quality	Δ vs base	Latency	Cost / 1k	Verdict
Claude Opus 4.7 Anthropic	87.4	+6.2	1840ms	$0.075	Winner
GPT-5 OpenAI	84.9	+3.7	1520ms	$0.050	—
Gemini 3 Pro Google	82.1	+0.9	980ms	$0.020	—
Claude Sonnet 4.6 Anthropic	81.2	baseline	1190ms	$0.018	Current
GPT-5 Mini OpenAI	73.6	-7.6	720ms	$0.004	—

Model	Quality	Δ vs base	Latency	Cost / 1k	Verdict
Claude Opus 4.7 Anthropic	87.4	+6.2	1840ms	$0.075	Winner
GPT-5 OpenAI	84.9	+3.7	1520ms	$0.050	—
Gemini 3 Pro Google	82.1	+0.9	980ms	$0.020	—
Claude Sonnet 4.6 Anthropic	81.2	baseline	1190ms	$0.018	Current
GPT-5 Mini OpenAI	73.6	-7.6	720ms	$0.004	—

Is your agent still on the right model?

You've done this once. In a spreadsheet.

From prompt to PR in under 20 minutes

Pick a goal

Scenarios generated from your prompt

Every model runs every scenario three times

Winner selected, PR opened

Five-turn conversations. Billed tokens. Wall-clock latency.

The dashboard isn't the deliverable. The PR is.

Switch booking agent model: Sonnet 4.6 → Opus 4.7

Different goal, different winner

Frequently asked questions