Simulation picks the winner. Production A/B testing confirms it on real traffic — head-to-head against baseline, with automatic rollback the moment the challenger slips.
Manual A/B testing as the way you find improvements is broken for agents: it's slow, exposes real users to your guesses, and most agents don't have the volume. Converra's search runs in simulation against scored personas.
Production A/B testing is the narrow, time-bounded confirmation step on top — only the variant that won in simulation gets a slice of live traffic, and only long enough to verify the win holds up against real users before promoting to 100%.
Before any real traffic sees a change, Converra has already run the variant against your scored simulation suite and confirmed head-to-head lift over baseline on the same personas.
The challenger is exposed to a configurable slice of production traffic (default 10–20%). Baseline serves the rest. The same conversation only ever sees one variant — no leakage.
Both arms are evaluated against the exact failure pattern the fix targets — routing errors, hallucinations, escalations — using identical detectors. Apples-to-apples by construction.
When the challenger clears the head-to-head threshold, Converra promotes it to 100% and hands off to production verification. If it underperforms baseline at any point during the gate, traffic snaps back automatically.
Every gated rollout ends in one of three places. Nothing reaches 100% of users without a verdict — and nothing stays at 100% if it regresses.
Challenger beat baseline head-to-head on the targeted failure pattern with enough paired traffic to be confident. Promoted to 100% and queued for production verification.
Challenger underperformed or regressed on a guardrail metric. Traffic snapped back to baseline within minutes. The variant is re-queued for re-diagnosis with the live evidence.
Not enough paired traffic to call a winner before the gate window closed. Converra holds baseline, extends the window, or rotates a different variant in — never silently promotes.
Two different jobs in the rollout. Most teams confuse them; both matter.
Concurrent split traffic. Baseline and challenger run side by side on different conversations during the rollout gate.
Answers: does the challenger beat baseline on real users, right now?
Before vs. after on 100% of traffic, once the variant is fully rolled out. Confirms the lift persists over time.
Answers: did the fix actually move the metric in production after we shipped it?
Production A/B testing splits live traffic between your current agent (baseline) and a candidate variant (challenger), then measures which one performs better on the specific failure pattern the change targets. In Converra, it runs as a gated rollout step on variants that simulation has already picked as winners — not as the primary search mechanism.
Production verification compares before vs. after on 100% of traffic once a variant is fully rolled out. Production A/B testing runs concurrently — baseline and challenger serve real traffic at the same time, on different conversations, for an apples-to-apples comparison without time-based confounders like a model update or a Monday-vs-Tuesday traffic shift.
Converra argues against using manual A/B testing as the search mechanism — writing variants by hand, exposing real users to your guesses, and waiting weeks for significance. Production A/B testing in Converra is the opposite: simulation does the search, picks a winner against your scored personas, and the live A/B is a short, narrow gate to confirm the win on real traffic with auto-rollback if it doesn't hold up.
Traffic snaps back to baseline automatically — no manual intervention, no overnight regression. The failed challenger is re-queued for diagnosis with the live evidence attached: what it tried, where it broke, and which guardrail tripped. The next variant generated targets the actual failure mode, not the simulated one.
It runs until one of three things happens: the head-to-head threshold for lift is cleared (promote), a guardrail metric regresses (rollback), or the gate window expires without enough paired traffic (extend or rotate). High-volume agents typically resolve within 1–3 days. Low-volume agents may need to lean on simulation evidence and skip straight to production verification.
The same failure pattern the fix targeted — routing errors, hallucinations, escalation rate, goal completion — measured with identical detectors on both arms. A small set of guardrail metrics (latency, escalation, cost) can independently trigger rollback even if the headline metric is improving.
Connect your agent and gate every rollout with a live A/B against baseline — auto-rollback included.
Start for free