Guide

How to test AI agents before deploying prompt changes

You can't unit test a prompt. Simulation testing with synthetic personas lets you run hundreds of conversations against your change and deploy only what wins.

Why traditional testing doesn't work for prompts

Prompts aren't functions

A function has defined inputs and outputs. A prompt produces different responses to the same input depending on context, conversation history, and model state. Traditional test assertions don't work.

Edge cases are combinatorial

A 5-turn conversation with 3 possible user intents per turn creates 243 paths. You can't write test cases for all of them. You need to simulate the distribution.

Regressions are silent

Fixing one failure pattern can subtly degrade another. Without testing against the full range of scenarios, you won't know until production users hit it.

How simulation testing works

Replay real failure scenarios

Synthetic personas recreate the exact user behaviors that caused failures in production - same intents, same edge cases, same conversation patterns. Not random test inputs.

Run both versions head-to-head

The current prompt and the changed prompt both handle the same scenarios. Same personas, same conversation flows - the only variable is the prompt change.

Regression-check across all scenarios

The change is tested not just against the failure it fixes, but against the full scenario set. Improvements in one area can't silently degrade another.

Deploy winners with confidence

Only changes that improve the target metric without regressing others get deployed. Every deployment is backed by simulation evidence, not hope.

Manual testing vs. simulation testing

ManualSimulation
What you testA few hand-picked scenariosHundreds of scenarios from real production patterns
Regression detectionHope nothing else brokeEvery scenario is re-tested against the change
Time per changeHours of manual testingMinutes - automated simulation runs
Statistical confidenceGut feeling from a few examplesHead-to-head comparison across the full scenario set
Production riskShip and monitor for complaintsValidated before deployment, verified after

How many test conversations do you need to trust a prompt change?

It depends on the diversity of your user base and failure patterns. A narrow fix for a specific edge case might validate in 20-30 simulated conversations. A broad prompt restructuring that affects all conversations needs 100+ to surface regressions. The key is coverage of scenario types, not raw conversation count.

Frequently asked questions

Can you unit test AI agent prompts?

Not meaningfully. Prompts produce variable outputs - the same input can generate different responses. Traditional assertions (expected output = actual output) don't work. Instead, you need behavioral testing: does the agent achieve its goals across a diverse set of conversations? Simulation testing with synthetic personas is the practical alternative.

What are synthetic personas?

Synthetic personas are AI-generated user profiles that simulate real user behaviors during testing. They're built from patterns observed in production conversations - real intents, real edge cases, real conversation flows. Unlike scripted test inputs, they can respond naturally and create realistic multi-turn interactions.

How is simulation testing different from running evals?

Evals grade agent outputs against expected results on static test sets. Simulation testing creates full multi-turn conversations with dynamic user behavior, then compares the changed version against the current version head-to-head. It tests the conversation flow, not just individual outputs.

How long does simulation testing take?

A typical simulation run with 50-100 conversations completes in minutes. The bottleneck is LLM inference, not setup. Converra parallelizes conversations and runs both the current and changed versions simultaneously for direct comparison.

What if my prompt change passes simulation but fails in production?

Production verification catches this. After deployment, Converra measures whether the fix actually improved the target metric using real production conversations. If it didn't, the fix is marked as 'not fixed' and the failure pattern is re-queued for diagnosis with additional evidence from the failed fix.

Ship prompt changes with confidence

Connect your agent, run simulations against your changes, and deploy only what wins.

Start for free