The rate of meaningful, verified movement on any tracked quality metric — per agent, per unit time. Counts an improvement only if it is statistically real, non-regressive, production-verified, and above a published minimum lift threshold.
A proposed industry-standard metric for how fast an agent is actually getting better. Covers both reactive improvements (fixing failures) and proactive ones (raising the ceiling on working behavior). Can't be computed from observability or eval tools alone — requires the full closed loop from diagnosis through production verification.
The gradual degradation of an AI agent's performance in production over time, caused by accumulating edge cases, model updates, stale context, and shifting user expectations.
Agent drift is broader than model drift. A prompt can drift even when the underlying model hasn't changed — new customers introduce patterns the prompt wasn't tuned for, provider models shift behavior silently, and product context goes stale. Drift compounds: small per-week degradations are invisible in any single conversation but add up to large drops over months.
Identifying the exact step and turn in a multi-turn agent conversation where a failure originated, with a root-cause category attached (prompt, model, config, orchestration).
Conversation-level scoring tells you a conversation failed; step-level diagnosis tells you where and why. Useful when the same visible failure (e.g. "agent didn't book") can have five different underlying causes — each requiring a different fix.
Measuring whether a deployed fix actually reduced failures using real production data after rollout — not just simulation scores.
Simulation tells you a variant is likely better; production verification proves it was. After deployment, failure rates on real traffic are compared before vs. after. Outcomes are marked verified (it worked), not fixed (it didn't), or confounded (too many simultaneous changes to attribute the result).
Running an agent against synthetic personas and scenarios offline to evaluate behavior without touching production traffic.
Simulation testing validates prompt or config changes before they reach users. High-quality simulation depends on personas derived from real production patterns, not generic templates — otherwise tests pass while real failures persist.
Validating every proposed change against scenarios the current agent already handles correctly, to guarantee no customer sees a worse experience.
Traditional software regression testing compares code output against expected strings. Agent regression testing compares behavior against a golden set of conversations — typically generated from production cases the baseline handles well. A change that improves one metric but breaks a golden case is blocked.
Generated user profiles that drive simulated conversations with an agent under test, derived from real production patterns rather than imagined archetypes.
A persona encodes who the user is, what they want, and how they communicate. Persona quality determines simulation quality — personas sampled from actual production traffic surface the same failure modes production does, while generic personas miss them.
A fixed set of conversations the current agent handles well, used as the regression suite for any proposed change.
Golden scenarios act as a floor: a variant may improve headline metrics, but if it fails a golden scenario, it ships a regression. Maintained automatically from production conversations that pass all quality gates.
Head-to-head comparison
Scoring a variant against the baseline only on personas and scenarios both actually ran — never on baseline-only data.
Head-to-head is the only statistically honest way to measure variant lift. Comparing a variant's average score against the baseline's overall average introduces sampling bias; head-to-head pairs eliminate it. Converra uses head-to-head pairs as the single source of truth for winner selection.
Lift
The head-to-head performance delta of a variant over the baseline, expressed per metric (success, sentiment, clarity, relevancy).
A variant only beats the baseline if its head-to-head lift is strictly positive. Non-positive lift means baseline remains the effective winner — even if the variant's absolute scores look higher, because absolute scores can be driven by easier scenario draws.
Evidence level
Insufficient, low, medium, or high — a qualitative rating for how many head-to-head pairs support a winner decision.
Evidence is counted in pairs of conversations where baseline and variant both ran. Baseline-only or variant-only conversations don't count. Low evidence means the measured lift is directionally useful but not decision-grade; high evidence means the result is robust to noise.
Prompt variant
A candidate replacement for an agent's system prompt, generated from a diagnosed failure pattern and tested against baseline in simulation.
Variants are evolutionary — small, targeted changes tied to a specific failure mode — rather than rewrites. This keeps the change set reviewable and the regression surface bounded.
Agent failure mode
A named, diagnosable way an AI agent breaks in production — e.g. skipped clarifying question, hallucinated tool call, context loss at handoff.
Failure modes are stable, citable IDs (AFM-01 through AFM-16) covering behavior gaps, orchestration errors, grounding errors, safety/security issues, and resource reliability. They make diagnoses comparable across agents and teams.
An agent producing a fluent, confident output that isn't grounded in retrieved context, tool results, or verifiable source material.
In agent systems, hallucinations usually surface as unverified factual claims (AFM-12) or hallucinated tool calls (AFM-09). Grounding defenses — retrieval, citation checks, and tool-output verification — prevent most of them at the prompt layer.
A system in which multiple specialized agents hand off tasks to each other, coordinated by a router or planner agent.
The orchestrator is usually the weakest link — failures often show up as context loss at handoff (AFM-10) or intent misclassification (AFM-06) at the routing layer, not in any individual agent. System-level diagnosis is required; single-agent evaluation misses these failures.
A security failure in which attacker-controlled input overrides an agent's instructions via conversational or retrieved content.
Tracked as AFM-14 in the agent failure mode taxonomy. Defense is architectural (trust boundaries, output validation, constrained tools) more than prompt-based — instructions to "ignore injections" are themselves injectable.
Fleet intelligence
Aggregated view of failure patterns, business impact, and step-level failure rates across every agent an organization runs.
Single-agent views miss patterns that repeat across agents — e.g. the same tool argument format issue affecting three products. Fleet-level aggregation surfaces systemic issues and prioritizes fixes by total business impact, not per-agent severity.
See these concepts in action
Converra applies every term on this page — diagnosis, simulation, regression, verification — to your agents automatically.