Metric proposal

Your agent's improvement rate is a liability (or a moat)

Agent Improvement Rate (AIR) measures which one — the net pace of verified quality change on a production agent, improvements minus regressions per unit time, gated so the number means something when a team reports it.

Most teams can't tell improvement from noise

Most teams running agents in production ship changes constantly — prompts, orchestration code, routing, tool definitions, retrieval corpus, model config, the eval harness. Small teams ship daily. Larger teams ship weekly or more. The volume isn't the problem.

Most of these agents aren't fine. They ship before they're battle-ready — tested on a few dozen scenarios, then exposed to production traffic that surfaces hundreds the team didn't model. Real usage finds edge cases faster than the improvement loop closes them. The agent falls behind on day one and stays behind.

What the loop keeps skipping is verification: matched head-to-head evaluation before deploy, regression checks against prior behavior, and measurement of real-traffic impact after deploy. These are the expensive human steps. They get dropped, and the shipped change becomes a coin flip instead of an improvement.

The data to compute AIR honestly doesn't exist at most teams. Not because the number would be embarrassing — because the verification steps AIR requires are the ones the loop is already skipping.

Existing metrics don't catch this

Existing metrics describe where an agent is, not how fast it's moving. Win rate and task success are snapshots. Regression rate is defensive. DORA deployment frequency counts deploys, not quality change.

LangSmith, Braintrust, and Galileo come closest: lift-over-baseline summed across experiments is a rate of sorts. Call that Eval Improvement Rate. It counts lift measured in eval — much of which doesn't survive the shift in distribution when it reaches production.

The rate that matters is the one measured on real traffic after deploy. Nobody names it.

Agent quality is the moat. Velocity keeps it.

Customers buy better agents. Quality is what you compete on. The question is whether your quality advantage is defensible, or whether a competitor with the same model access and a similar starting point can catch up.

At manual-loop speed, they can. What's observable about your agent — the prompts, the visible output patterns — leaks through traces, jailbreaks, churned employees, or reverse engineering. A competitor willing to grind can roughly match what you released this quarter.

At high enough improvement velocity, they can't. Each verified improvement raises the baseline the next is measured against, so the team with the faster loop pulls further ahead every cycle. And what compounds isn't only the observable layer: routing, retrieval corpus, tool definitions, and orchestration don't leak. By the time a competitor recreates what they can see, you're already several improvements later on the layers they can't.

Traffic amplifies this. A high-traffic agent reaches statistical significance in hours; a low-traffic one might need days of real traffic to verify the same change. So scale itself accelerates the loop: the agent with more customers today compounds improvements faster than the one with fewer. An early quality-plus-traffic lead compounds into a quality lead that new entrants can't close by working harder.

That's why the rate deserves to be named and measured. Quality is what the customer sees. Velocity is what determines whether anyone can take it from you.

Definition

AIR = how fast your agent actually gets better in production, net of regressions.

Formally: the net rate of verified quality change per agent per unit time — verified improvements minus verified regressions, each event gated below. Signed, so a team shipping more regressions than improvements runs at negative AIR.

  1. 1

    Not noise — statistically real

    The new version has to actually beat the old one in a matched comparison — not just score high on its own. (Technically: head-to-head pairs with evidence at medium or higher.)

  2. 2

    Not a trade — non-regressive

    You don't get to win on one dimension by quietly losing on another. The golden-scenario regression suite passes, and no other tracked metric drops more than the team's published tolerance.

  3. 3

    Not fake — production-verified

    The improvement has to show up on real traffic, not just in eval. A meaningful share of eval lift doesn't survive contact with production, and those don't count. (Technically: the target metric moves in the intended direction at p < 0.05 with statistical power ≥ 80% at the team's published minimum detectable effect. Verification window: hours for high-traffic agents, days or weeks for low-traffic ones. Publish the MDE — p < 0.05 without power is a lottery ticket.) See production verification.

  4. 4

    Not trivial — above floor

    The lift has to be big enough to care about. Default floor: 1% absolute or 5% relative, whichever is higher for the metric. Publishing the floor matters more than the specific number — a team with a 0.1% floor is telling you to ignore their AIR.

Any tracked metric qualifies: success rate, sentiment, cost per conversation, latency, tool-call precision, hallucination rate, containment. AIR is always expressed as a rate — if your support agent accumulates a net of 2.4 gate-passing events per month, its AIR is 2.4/month. A month with 3 improvements and 5 regressions runs at −2/month.

A change can be any edit to the agent system that affects production behavior: prompts, orchestration code, routing logic, tool definitions, retrieval corpus, model config, evaluation harness, deployment infrastructure. The audit trail is your change log — git commits, deploy records, data-pipeline runs. Each candidate either passes the gates or doesn't count.

AIR counts both reactive improvements (fixing a regression) and proactive ones (raising the ceiling on something that wasn't broken). A support agent at 4.3 sentiment pushed to 4.6, regression suite clean, verified in production, is +1 on the sentiment axis — even though nothing was on fire.

Prerequisite: a quality curve

Computing AIR assumes you're already measuring agent quality over time. Without a continuous quality curve, there's nothing for verified improvements to step up on. Most teams that can't report AIR today can't because this substrate doesn't exist yet — not because the loop is failing, but because nobody has been scoring the agent continuously.

For the scoring mechanics: production verification covers before/after measurement on real traffic, and regression testing covers how head-to-head pairs produce the evidence levels gate 1 requires. What neither page covers in depth — and what's worth a separate piece — is how to maintain the continuous quality curve AIR sits on top of.

Gate 3 is the load-bearing one

All four gates are discipline. Gate 3 is where AIR separates from Eval Improvement Rate.

A meaningful share of eval lift doesn't survive contact with production distribution. Scenario coverage is partial, persona drift is real, and production mixes keep shifting. Eval-only numbers describe what teams hoped would happen.

AIR forces the verification onto real traffic. That's what makes the number expensive to produce and hard to game.

Not all AIR is equal

The four gates make sure an improvement is real. They don't make sure it matters.

A team can run at 3 AIR per month with every unit spent on metrics that were already fine — sentiment lifts on an agent already at 95% task success, latency shaves on an agent within acceptable latency — while the actual business pain (hallucinated invite links, failed handoffs, cost blowup on a specific intent) goes untouched. The AIR number looks elite. The business doesn't get better.

So AIR should always be reported with attribution, broken down by the metric moved: “AIR in Q2: 2.4/month — 1.2 from sentiment, 0.8 from latency, 0.4 from hallucination.” Anyone reading can see whether improvements are distributed across metrics or concentrated on ones that weren't hurting.

One step further: pair AIR with Critical-AIR, the subset of AIR spent on metrics currently below an acceptable threshold or tied to failure modes ranked by aggregate business impact — frequency × severity, not severity alone. A frequent minor leak and a rare catastrophic failure can both be critical; a severe failure on a code path that runs twice a month usually isn't. A team running at AIR 2.4/month and Critical-AIR 0.2/month is improving the wrong things — motion without progress. A healthy loop moves both numbers; improvement theater moves only the first.

Operationally: Critical-AIR sums the AIR units attributed to (a) metrics whose current value is below the team's published acceptable threshold at the start of the window, or (b) failure modes ranked in the top-k by frequency × severity. Publish the list alongside the number — otherwise Critical-AIR reduces to AIR with extra words.

Frequency-weighting credit: Shonnah Hughes's framing of “Agent Improvement Velocity” on LinkedIn — “how fast your AI gets better at its highest-frequency decisions” — sharpened the frequency dimension above.

Provisional bands

Illustrative, not benchmarked. One improvement cycle per sprint gives a focused agent about two candidate changes per month; real AIR sits below that ceiling because not every candidate passes all four gates. Elite teams run two or three parallel improvement streams. Negative AIR means the loop is producing verified regressions faster than verified improvements — different diagnostic category, not a weak number. DORA recalibrated its bands as survey data accumulated. AIR should do the same.

BandAIR / agent / month
Elite3+
High1–2
Medium~0.5
Low0 to 0.25
Losing ground< 0

What AIR deliberately doesn't measure

Step-change model swaps that pass all four gates but aren't tied to a specific improvement cycle: they collapse into the count without capturing magnitude.

Dimensions you don't instrument. If it's not a tracked metric, it can't be an AIR unit. This is a feature: AIR is only as honest as the instrumentation it sits on.

Slow-burn regressions that surface after the verification window closes. Whatever window the team publishes catches most production effects; agent drift over a quarter is a different problem, handled by a different metric.

Impact. AIR counts every gate-passing improvement as one unit regardless of how often the affected code path runs or how severe the underlying failure was. A 0.1% lift on a dominant flow and a 0.1% lift on a rarely-run edge case look identical in the count. Critical-AIR partially addresses this by ranking failure modes on aggregate impact (frequency × severity); teams that care about this should rank explicitly, not implicitly.

A good AIR number means the improvement loop is working. It does not mean the agent is good on dimensions no one measured, or that the loop is aimed where the business actually hurts.

What AIR lets you say

Our support agent runs at 2.4 AIR/month. Sales qualification runs at 0.6/month. Why, and what should sales borrow?
Eighty percent of our AIR came from proactive improvements, not firefighting.
Headcount doubled this year. AIR didn't. Something in the process isn't scaling.

None of these sentences are available today without a named, gated, comparable number.

Closing

AIR is a proposal. The spec above is open. Measure your own, publish your floor and your tolerances, count only what clears all four gates. If the number is moving, the loop is working. If it isn't, you know where to look.

Comments, corrections, or adoptions: oren@converra.ai.

FAQ

What is Agent Improvement Rate?

Agent Improvement Rate (AIR) is the net pace at which a production agent's quality actually improves — verified improvements minus verified regressions, per agent per unit time. Each event (up or down) has to pass four gates: statistically real (head-to-head evidence at medium or higher), non-regressive (regression suite passes, no other tracked metric degrades), production-verified on real traffic within a published window, and above a published minimum lift. Computing AIR presupposes continuous agent-quality measurement; without a quality curve over time there is nothing for AIR to sit on top of.

How is AIR different from win rate or task success?

Win rate and task success are point-in-time scores. AIR is a derivative — how fast the agent is getting better. A 92% success rate tells you where you are; AIR tells you whether the team running the agent is compounding improvements or standing still.

Isn't this just Eval Improvement Rate with extra steps?

Close. Eval platforms already sum lift-over-baseline across experiments, which is a rate. The difference is gate 3: production verification. A substantial share of eval lift doesn't survive contact with real production distribution — scenario coverage is partial, persona drift is real, production mixes shift. AIR only counts the improvements that actually moved the metric on real traffic after deploy.

Why does "meaningful" need a threshold?

Without one, AIR rewards noise. A 0.2% bump on a volatile metric counts the same as a 4% lift that held for two weeks in production. Thresholds also block the obvious gaming path: splitting one real improvement into twenty micro-PRs to inflate the rate.

Does AIR only count fixes to broken behavior?

No. AIR counts both reactive improvements (closing a regression, fixing a production failure mode) and proactive ones (raising the ceiling on behavior that was already working). An agent with no open failures that lifts tool-precision from 94% to 96% is still improving.

What's a good AIR number?

Nobody knows yet. The bands on this page (3+ elite, 1–2 high, ~0.5 medium, <0.25 low) are a back-of-envelope: one improvement cycle per sprint lands around 2 AIR/month per focused agent, and elite teams run parallel streams. They're placeholders until real industry data exists. DORA recalibrated its bands as survey data accumulated; AIR should follow the same pattern.

Can I have high AIR and still be losing?

Yes. The four gates verify improvements are real, not that they matter. An agent with 3 AIR per month entirely on metrics that were already fine is motion without progress — improvement theater. That's why AIR should be reported broken down by metric, and ideally paired with Critical-AIR: the subset of AIR spent on metrics currently below an acceptable threshold or tied to top-ranked failure modes. A healthy loop moves both; a theater loop moves only the first.

Why per-agent and not per-team?

Agents have wildly different baselines, surface areas, and traffic. A billing agent at 97% has almost no headroom; a new triage agent at 70% has enormous headroom. Team-level averaging hides which agents are moving and which are stuck, and lets one fast-moving agent mask five stagnant ones.

Who audits the thresholds? Can a team game AIR with a low floor?

Yes, and that's a known weakness. AIR is self-reported. Cross-team comparison requires teams to publish their thresholds alongside the number. A team reporting AIR 3.2/month with a 0.1% floor is measurably different from one reporting AIR 3.2/month with a 3% floor. The floor is part of the number, not separate from it. DORA handled the same issue by insisting methodology be published with results.

Why count events instead of summing magnitude (score-diff)?

Summing the size of each improvement would be more precise — a 4% lift clearly matters more than a 1% lift. But you can't add a 4% success-rate bump and a 50ms latency cut into one number, which means magnitude-based AIR stops being comparable across agents or teams. Counting gate-passing events (with gate 4 blocking the trivial ones) is the tradeoff that gives you one shared number. NPS, MRR, and each DORA metric made the same call. Teams that want the magnitude detail should report per-metric lifts as a diagnostic overlay alongside the count.

Does AIR credit model upgrades the provider shipped?

Edge case. If a model provider silently improves your agent's performance, the quality curve moves but no change you shipped triggered it. Two defensible positions: conservative teams don't count it (no gate-passing candidate change was attributable); permissive teams count it as +1 tied to the config bump that adopted the new model. Pick one and be explicit. Mixing across the same AIR number hides whether the improvement was the team's work or the vendor's.

Is AIR a Converra-proprietary metric?

No. The spec is open. Any team or vendor can compute and report AIR against it. Converra happens to build the loop that closes the four gates end-to-end, which is why we're publishing the term, but the point is a shared unit of progress for the field.