Capability

Production A/B testing

Simulation picks the winner. Production A/B testing confirms it on real traffic — head-to-head against baseline, with automatic rollback the moment the challenger slips.

The gate between “simulation winner” and 100% rollout

Manual A/B testing as the way you find improvements is broken for agents: it's slow, exposes real users to your guesses, and most agents don't have the volume. Converra's search runs in simulation against scored personas.

Production A/B testing is the narrow, time-bounded confirmation step on top — only the variant that won in simulation gets a slice of live traffic, and only long enough to verify the win holds up against real users before promoting to 100%.

How production A/B testing works

Simulation picks the winner

Before any real traffic sees a change, Converra has already run the variant against your scored simulation suite and confirmed head-to-head lift over baseline on the same personas.

Traffic splits between baseline and challenger

The challenger is exposed to a configurable slice of production traffic (default 10–20%). Baseline serves the rest. The same conversation only ever sees one variant — no leakage.

Same detection, same scoring, side by side

Both arms are evaluated against the exact failure pattern the fix targets — routing errors, hallucinations, escalations — using identical detectors. Apples-to-apples by construction.

Promote on lift, auto-rollback on regression

When the challenger clears the head-to-head threshold, Converra promotes it to 100% and hands off to production verification. If it underperforms baseline at any point during the gate, traffic snaps back automatically.

Three verdicts, no silent promotions

Every gated rollout ends in one of three places. Nothing reaches 100% of users without a verdict — and nothing stays at 100% if it regresses.

Promoted

Challenger beat baseline head-to-head on the targeted failure pattern with enough paired traffic to be confident. Promoted to 100% and queued for production verification.

Routing failure rate: baseline 11.4%, challenger 3.8% across 1,840 paired conversations. Promoted on day 2.

Rolled Back

Challenger underperformed or regressed on a guardrail metric. Traffic snapped back to baseline within minutes. The variant is re-queued for re-diagnosis with the live evidence.

Hallucination rate held flat, but escalation rate jumped from 2.1% to 4.7%. Rolled back at hour 6.

Inconclusive

Not enough paired traffic to call a winner before the gate window closed. Converra holds baseline, extends the window, or rotates a different variant in — never silently promotes.

Low-volume agent saw 142 paired conversations in 7 days. Below the head-to-head threshold — gate extended, no promotion.

A/B testing vs. production verification

Two different jobs in the rollout. Most teams confuse them; both matter.

Production A/B testing

Concurrent split traffic. Baseline and challenger run side by side on different conversations during the rollout gate.

Answers: does the challenger beat baseline on real users, right now?

Production verification

Before vs. after on 100% of traffic, once the variant is fully rolled out. Confirms the lift persists over time.

Answers: did the fix actually move the metric in production after we shipped it?

Frequently asked questions

What is production A/B testing for AI agents?

Production A/B testing splits live traffic between your current agent (baseline) and a candidate variant (challenger), then measures which one performs better on the specific failure pattern the change targets. In Converra, it runs as a gated rollout step on variants that simulation has already picked as winners — not as the primary search mechanism.

How is this different from production verification?

Production verification compares before vs. after on 100% of traffic once a variant is fully rolled out. Production A/B testing runs concurrently — baseline and challenger serve real traffic at the same time, on different conversations, for an apples-to-apples comparison without time-based confounders like a model update or a Monday-vs-Tuesday traffic shift.

Doesn't Converra argue against A/B testing? How does this fit?

Converra argues against using manual A/B testing as the search mechanism — writing variants by hand, exposing real users to your guesses, and waiting weeks for significance. Production A/B testing in Converra is the opposite: simulation does the search, picks a winner against your scored personas, and the live A/B is a short, narrow gate to confirm the win on real traffic with auto-rollback if it doesn't hold up.

What happens when the challenger underperforms?

Traffic snaps back to baseline automatically — no manual intervention, no overnight regression. The failed challenger is re-queued for diagnosis with the live evidence attached: what it tried, where it broke, and which guardrail tripped. The next variant generated targets the actual failure mode, not the simulated one.

How long does a production A/B test run?

It runs until one of three things happens: the head-to-head threshold for lift is cleared (promote), a guardrail metric regresses (rollback), or the gate window expires without enough paired traffic (extend or rotate). High-volume agents typically resolve within 1–3 days. Low-volume agents may need to lean on simulation evidence and skip straight to production verification.

What metrics decide the winner?

The same failure pattern the fix targeted — routing errors, hallucinations, escalation rate, goal completion — measured with identical detectors on both arms. A small set of guardrail metrics (latency, escalation, cost) can independently trigger rollback even if the headline metric is improving.

Ship with a safety net

Connect your agent and gate every rollout with a live A/B against baseline — auto-rollback included.

Start for free