Capability

No customer ever sees a worse experience

Every agent improvement is regression-tested before it ships. Converra checks every change against the scenarios your agent already handles well. If a variant breaks something, you see the tradeoff before it reaches production. If something slips through, it rolls back automatically.

Every prompt change is a risk

You improve the agent's handling of refund requests and break its ability to transfer calls. You adjust the tone to be more empathetic and it starts missing booking confirmations. Agent prompts are interconnected. Changing one behavior can silently degrade others.

Without regression testing, you find out from customers. With it, you find out before deployment.

How regression testing works in Converra

1

Golden sets are generated from your prompt analysis

Converra identifies the scenarios your current agent handles well and creates a "golden set" of test cases. These represent the behaviors you need to protect when making changes.

2

Every winning variant runs against the golden set

When a prompt variant shows improvement in head-to-head simulation, it automatically runs against your golden set. Short 2-3 turn exchanges validate that existing capabilities are preserved.

3

Regressions are surfaced with full context

If a variant degrades performance on any golden set scenario, Converra shows you exactly what changed: which scenario, which metric, and the full conversation side-by-side. You see the tradeoff clearly.

4

Post-deployment rollback catches what simulation missed

If any production metric degrades after a deployment, Converra rolls back automatically before your next customer conversation. Simulation catches most regressions. Rollback catches the rest.

Why AI agents need regression testing

Agent improvements are fragile

A prompt change that improves booking completion might break cancellation handling. A tone adjustment that sounds better for frustrated users might confuse new ones. Every improvement has the potential to break something else.

Traditional testing doesn't catch it

Unit tests and eval suites test what you thought of. Regressions happen in the gaps between test cases, in the combinations and edge cases that were working fine until someone changed a paragraph in the system prompt.

Manual review can't scale

Reading through conversation logs after every prompt change works for the first few updates. At production volume, with multiple agents and weekly iterations, manual regression checking falls apart.

The cost of a regression in production is high

A regression found in simulation costs nothing. A regression found in production costs customer trust, support tickets, and engineering time to investigate and roll back. The math is straightforward.

Two layers of regression protection

Simulation catches regressions before deployment. Production monitoring catches anything that slips through.

Pre-deployment: Simulation regression testing

  • Golden set generated automatically from prompt analysis
  • Every variant is tested against golden set before deployment
  • Regressions surfaced with full side-by-side comparison
  • Fluke detection prevents false positives from noisy LLM outputs

Post-deployment: Production metric monitoring

  • Production metrics tracked after every deployment
  • Automatic rollback if any metric regresses
  • Rollback happens before your next customer conversation
  • Full audit trail of what changed and why it was rolled back

Regression testing enables faster iteration

Teams without regression testing ship changes slowly because they're afraid of breaking things. Teams with regression testing ship with confidence because every change is validated. The faster you iterate, the faster your agent improves. Regression testing removes the fear that slows teams down.

Frequently asked questions

What is a golden set?

A golden set is a collection of test scenarios that represent the capabilities your agent already handles well. Converra generates these automatically by analyzing your prompt and identifying the key behaviors it should preserve. Think of it as the "do no harm" check for every change.

How does fluke detection work?

LLM outputs are non-deterministic. A single bad response doesn't necessarily mean a regression. Converra runs multiple simulations per scenario and uses statistical methods to distinguish real regressions from random noise in model outputs.

Can I add custom scenarios to the golden set?

Yes. Converra generates golden sets automatically, and you can add specific scenarios that matter to your business. Both auto-generated and custom scenarios are included in every regression check.

How fast is the rollback?

Instant. Converra monitors production metrics after deployment and triggers rollback automatically when regressions are detected. The previous version is restored before the next customer conversation reaches the agent.

Does regression testing slow down the optimization loop?

Regression tests use short 2-3 turn exchanges designed for fast validation. A typical regression check adds 1-2 minutes to the optimization cycle. The alternative is finding the regression in production, which takes much longer to detect and fix.

What if I want to ship a change that has a known regression tradeoff?

Converra surfaces tradeoffs clearly. If a variant improves task completion by 15% but slightly degrades tone in one edge case, you see that tradeoff and make the call. Regression testing informs your decision, it doesn't block it.

Ship agent improvements without the risk

Connect your agent and see regression testing in action. Every change is validated before it reaches your customers.

Start for free