Name: Converra
Availability: InStock
Author: Converra

Agent Improvement Rate (AIR)

Also called: AIR

The rate of meaningful, verified movement on any tracked quality metric — per agent, per unit time. Counts an improvement only if it is statistically real, non-regressive, production-verified, and above a published minimum lift threshold.

A proposed industry-standard metric for how fast an agent is actually getting better. Covers both reactive improvements (fixing failures) and proactive ones (raising the ceiling on working behavior). Can't be computed from observability or eval tools alone — requires the full closed loop from diagnosis through production verification.

Full spec

Agent drift

The gradual degradation of an AI agent's performance in production over time, caused by accumulating edge cases, model updates, stale context, and shifting user expectations.

Agent drift is broader than model drift. A prompt can drift even when the underlying model hasn't changed — new customers introduce patterns the prompt wasn't tuned for, provider models shift behavior silently, and product context goes stale. Drift compounds: small per-week degradations are invisible in any single conversation but add up to large drops over months.

Full reference

Step-level diagnosis

Identifying the exact step and turn in a multi-turn agent conversation where a failure originated, with a root-cause category attached (prompt, model, config, orchestration).

Conversation-level scoring tells you a conversation failed; step-level diagnosis tells you where and why. Useful when the same visible failure (e.g. "agent didn't book") can have five different underlying causes — each requiring a different fix.

Full reference

Production verification

Measuring whether a deployed fix actually reduced failures using real production data after rollout — not just simulation scores.

Simulation tells you a variant is likely better; production verification proves it was. After deployment, failure rates on real traffic are compared before vs. after. Outcomes are marked verified (it worked), not fixed (it didn't), or confounded (too many simultaneous changes to attribute the result).

Full reference

Agent A/B testing (production)

Also called: Production A/B testing, Split traffic testing, Gated rollout

Splitting live traffic between a baseline AI agent and a challenger variant during a gated rollout to confirm head-to-head lift on real users, with automatic rollback if the challenger underperforms.

Production A/B testing for AI agents is the confirmation step between a simulation-picked winner and a 100% rollout. Unlike manual A/B testing as a search mechanism, the variant under test has already cleared simulation and regression. The live split is short, narrow, and time-bounded — promote on lift, roll back on regression, extend on insufficient paired traffic. Distinct from production verification, which compares before vs. after on 100% of traffic after rollout.

Full reference

Simulation testing

Running an agent against synthetic personas and scenarios offline to evaluate behavior without touching production traffic.

Simulation testing validates prompt or config changes before they reach users. High-quality simulation depends on personas derived from real production patterns, not generic templates — otherwise tests pass while real failures persist.

Full reference

Regression testing (for agents)

Validating every proposed change against scenarios the current agent already handles correctly, to guarantee no customer sees a worse experience.

Traditional software regression testing compares code output against expected strings. Agent regression testing compares behavior against a golden set of conversations — typically generated from production cases the baseline handles well. A change that improves one metric but breaks a golden case is blocked.

Full reference

Synthetic personas

Also called: Synthetic users, Test personas

Generated user profiles that drive simulated conversations with an agent under test, derived from real production patterns rather than imagined archetypes.

A persona encodes who the user is, what they want, and how they communicate. Persona quality determines simulation quality — personas sampled from actual production traffic surface the same failure modes production does, while generic personas miss them.

Full reference

Golden scenarios

A fixed set of conversations the current agent handles well, used as the regression suite for any proposed change.

Golden scenarios act as a floor: a variant may improve headline metrics, but if it fails a golden scenario, it ships a regression. Maintained automatically from production conversations that pass all quality gates.

Head-to-head comparison

Scoring a variant against the baseline only on personas and scenarios both actually ran — never on baseline-only data.

Head-to-head is the only statistically honest way to measure variant lift. Comparing a variant's average score against the baseline's overall average introduces sampling bias; head-to-head pairs eliminate it. Converra uses head-to-head pairs as the single source of truth for winner selection.

Lift

The head-to-head performance delta of a variant over the baseline, expressed per metric (success, sentiment, clarity, relevancy).

A variant only beats the baseline if its head-to-head lift is strictly positive. Non-positive lift means baseline remains the effective winner — even if the variant's absolute scores look higher, because absolute scores can be driven by easier scenario draws.

Evidence level

Insufficient, low, medium, or high — a qualitative rating for how many head-to-head pairs support a winner decision.

Evidence is counted in pairs of conversations where baseline and variant both ran. Baseline-only or variant-only conversations don't count. Low evidence means the measured lift is directionally useful but not decision-grade; high evidence means the result is robust to noise.

Prompt variant

A candidate replacement for an agent's system prompt, generated from a diagnosed failure pattern and tested against baseline in simulation.

Variants are evolutionary — small, targeted changes tied to a specific failure mode — rather than rewrites. This keeps the change set reviewable and the regression surface bounded.

Agent failure mode

A named, diagnosable way an AI agent breaks in production — e.g. skipped clarifying question, hallucinated tool call, context loss at handoff.

Failure modes are stable, citable IDs (AFM-01 through AFM-16) covering behavior gaps, orchestration errors, grounding errors, safety/security issues, and resource reliability. They make diagnoses comparable across agents and teams.

Full taxonomy

Hallucination (in AI agents)

An agent producing a fluent, confident output that isn't grounded in retrieved context, tool results, or verifiable source material.

In agent systems, hallucinations usually surface as unverified factual claims (AFM-12) or hallucinated tool calls (AFM-09). Grounding defenses — retrieval, citation checks, and tool-output verification — prevent most of them at the prompt layer.

Fix guide

Multi-agent orchestration

A system in which multiple specialized agents hand off tasks to each other, coordinated by a router or planner agent.

The orchestrator is usually the weakest link — failures often show up as context loss at handoff (AFM-10) or intent misclassification (AFM-06) at the routing layer, not in any individual agent. System-level diagnosis is required; single-agent evaluation misses these failures.

Fix guide

Prompt injection

A security failure in which attacker-controlled input overrides an agent's instructions via conversational or retrieved content.

Tracked as AFM-14 in the agent failure mode taxonomy. Defense is architectural (trust boundaries, output validation, constrained tools) more than prompt-based — instructions to "ignore injections" are themselves injectable.

Fleet intelligence

Aggregated view of failure patterns, business impact, and step-level failure rates across every agent an organization runs.

Single-agent views miss patterns that repeat across agents — e.g. the same tool argument format issue affecting three products. Fleet-level aggregation surfaces systemic issues and prioritizes fixes by total business impact, not per-agent severity.

Agent contribution scoring

Scoring each agent's individual contribution in a multi-agent pipeline — per turn and per handoff — so a failure is attributed to the agent that caused it, not the last one that touched it.

In multi-agent systems the visible failure surfaces at the final node, but the root cause is often upstream: an incomplete handoff, a routing mistake, or lost context. Contribution scoring traces causation back through the chain and scores handoff quality, shifting the question from "which agent failed?" to "which agent caused the failure?"

Full reference

LLM agent evaluation

Also called: AI agent evaluation, LLM evaluation

Measuring LLM-powered agent behavior across realistic multi-turn conversations on task success, safety, and business metrics — broader than scoring a single generated answer.

Distinct from model evaluation, which benchmarks a model in isolation, agent evaluation measures the deployed system — prompts, tools, routing, and orchestration — on the conversation shapes your users actually produce. Most searchers treat "LLM evaluation" and "AI agent evaluation" as one category, so they belong on one canonical page.

Full reference

Agentic AI

Also called: AI agents, Agentic systems

AI systems that pursue goals over multiple steps — planning, calling tools, and often coordinating other agents — rather than producing one response to one prompt.

Agentic AI is the parent category for multi-agent orchestration, agent evaluation, and agent optimization. Because agents act over many turns and tools, their failures are behavioral and systemic, not single-output errors — which is why they need conversation-level diagnosis and production verification, not just model benchmarks.

Agent optimization

Also called: AI agent optimization

Systematically improving a production AI agent's behavior — diagnosing failures, generating and testing changes, and verifying impact — rather than hand-editing prompts and hoping.

Agent optimization is the run-side counterpart to building agents: it owns behavior in production. In agentic AI systems, optimization spans prompts, routing, guardrails, and config, and is only credible when each change is regression-tested before it ships and production-verified after.

Full reference

Agent reliability

How consistently an AI agent produces correct, safe, on-task behavior across real production traffic and over time — the agentic-AI analogue of uptime for a service.

Reliability degrades through drift, accumulating edge cases, and silent model changes. Measuring it requires one stable failure definition applied identically across every surface; improving it requires catching regressions before they ship and verifying fixes on real traffic afterward.

Related: agent drift

Agent performance

How well an AI agent achieves its goals — task success, accuracy, and business outcomes — measured head-to-head rather than on absolute scores that scenario mix can inflate.

In agentic AI, performance is multi-dimensional (goal achievement, sentiment, clarity, safety) and only comparable head-to-head: a variant's higher average can come from an easier scenario draw, so lift is measured on paired conversations that both the baseline and the variant actually ran.

AI agent glossary

Agent Improvement Rate (AIR)

Agent drift

Step-level diagnosis

Production verification

Agent A/B testing (production)

Simulation testing

Regression testing (for agents)

Synthetic personas

Golden scenarios

Head-to-head comparison

Lift

Evidence level

Prompt variant

Agent failure mode

Hallucination (in AI agents)

Multi-agent orchestration

Prompt injection

Fleet intelligence

Agent contribution scoring

LLM agent evaluation

Agentic AI

Agent optimization

Agent reliability

Agent performance

See these concepts in action