Most teams can't tell improvement from noise
Most teams running agents in production ship changes constantly — prompts, orchestration code, routing, tool definitions, retrieval corpus, model config, the eval harness. Small teams ship daily. Larger teams ship weekly or more. The volume isn't the problem.
Most of these agents aren't fine. They ship before they're battle-ready — tested on a few dozen scenarios, then exposed to production traffic that surfaces hundreds the team didn't model. Real usage finds edge cases faster than the improvement loop closes them. The agent falls behind on day one and stays behind.
What the loop keeps skipping is verification: matched head-to-head evaluation before deploy, regression checks against prior behavior, and measurement of real-traffic impact after deploy. These are the expensive human steps. They get dropped, and the shipped change becomes a coin flip instead of an improvement.
The data to compute AIR honestly doesn't exist at most teams. Not because the number would be embarrassing — because the verification steps AIR requires are the ones the loop is already skipping.
Existing metrics don't catch this
Existing metrics describe where an agent is, not how fast it's moving. Win rate and task success are snapshots. Regression rate is defensive. DORA deployment frequency counts deploys, not quality change.
LangSmith, Braintrust, and Galileo come closest: lift-over-baseline summed across experiments is a rate of sorts. Call that Eval Improvement Rate. It counts lift measured in eval — much of which doesn't survive the shift in distribution when it reaches production.
The rate that matters is the one measured on real traffic after deploy. Nobody names it.
Agent quality is the moat. Velocity keeps it.
Customers buy better agents. Quality is what you compete on. The question is whether your quality advantage is defensible, or whether a competitor with the same model access and a similar starting point can catch up.
At manual-loop speed, they can. What's observable about your agent — the prompts, the visible output patterns — leaks through traces, jailbreaks, churned employees, or reverse engineering. A competitor willing to grind can roughly match what you released this quarter.
At high enough improvement velocity, they can't. Each verified improvement raises the baseline the next is measured against, so the team with the faster loop pulls further ahead every cycle. And what compounds isn't only the observable layer: routing, retrieval corpus, tool definitions, and orchestration don't leak. By the time a competitor recreates what they can see, you're already several improvements later on the layers they can't.
Traffic amplifies this. A high-traffic agent reaches statistical significance in hours; a low-traffic one might need days of real traffic to verify the same change. So scale itself accelerates the loop: the agent with more customers today compounds improvements faster than the one with fewer. An early quality-plus-traffic lead compounds into a quality lead that new entrants can't close by working harder.
That's why the rate deserves to be named and measured. Quality is what the customer sees. Velocity is what determines whether anyone can take it from you.
Definition
AIR = how fast your agent actually gets better in production, net of regressions.
Formally: the net rate of verified quality change per agent per unit time — verified improvements minus verified regressions, each event gated below. Signed, so a team shipping more regressions than improvements runs at negative AIR.
- 1
Not noise — statistically real
The new version has to actually beat the old one in a matched comparison — not just score high on its own. (Technically: head-to-head pairs with evidence at medium or higher.)
- 2
Not a trade — non-regressive
You don't get to win on one dimension by quietly losing on another. The golden-scenario regression suite passes, and no other tracked metric drops more than the team's published tolerance.
- 3
Not fake — production-verified
The improvement has to show up on real traffic, not just in eval. A meaningful share of eval lift doesn't survive contact with production, and those don't count. (Technically: the target metric moves in the intended direction at p < 0.05 with statistical power ≥ 80% at the team's published minimum detectable effect. Verification window: hours for high-traffic agents, days or weeks for low-traffic ones. Publish the MDE — p < 0.05 without power is a lottery ticket.) See production verification.
- 4
Not trivial — above floor
The lift has to be big enough to care about. Default floor: 1% absolute or 5% relative, whichever is higher for the metric. Publishing the floor matters more than the specific number — a team with a 0.1% floor is telling you to ignore their AIR.
Any tracked metric qualifies: success rate, sentiment, cost per conversation, latency, tool-call precision, hallucination rate, containment. AIR is always expressed as a rate — if your support agent accumulates a net of 2.4 gate-passing events per month, its AIR is 2.4/month. A month with 3 improvements and 5 regressions runs at −2/month.
A change can be any edit to the agent system that affects production behavior: prompts, orchestration code, routing logic, tool definitions, retrieval corpus, model config, evaluation harness, deployment infrastructure. The audit trail is your change log — git commits, deploy records, data-pipeline runs. Each candidate either passes the gates or doesn't count.
AIR counts both reactive improvements (fixing a regression) and proactive ones (raising the ceiling on something that wasn't broken). A support agent at 4.3 sentiment pushed to 4.6, regression suite clean, verified in production, is +1 on the sentiment axis — even though nothing was on fire.
Prerequisite: a quality curve
Computing AIR assumes you're already measuring agent quality over time. Without a continuous quality curve, there's nothing for verified improvements to step up on. Most teams that can't report AIR today can't because this substrate doesn't exist yet — not because the loop is failing, but because nobody has been scoring the agent continuously.
For the scoring mechanics: production verification covers before/after measurement on real traffic, and regression testing covers how head-to-head pairs produce the evidence levels gate 1 requires. What neither page covers in depth — and what's worth a separate piece — is how to maintain the continuous quality curve AIR sits on top of.
Gate 3 is the load-bearing one
All four gates are discipline. Gate 3 is where AIR separates from Eval Improvement Rate.
A meaningful share of eval lift doesn't survive contact with production distribution. Scenario coverage is partial, persona drift is real, and production mixes keep shifting. Eval-only numbers describe what teams hoped would happen.
AIR forces the verification onto real traffic. That's what makes the number expensive to produce and hard to game.
Not all AIR is equal
The four gates make sure an improvement is real. They don't make sure it matters.
A team can run at 3 AIR per month with every unit spent on metrics that were already fine — sentiment lifts on an agent already at 95% task success, latency shaves on an agent within acceptable latency — while the actual business pain (hallucinated invite links, failed handoffs, cost blowup on a specific intent) goes untouched. The AIR number looks elite. The business doesn't get better.
So AIR should always be reported with attribution, broken down by the metric moved: “AIR in Q2: 2.4/month — 1.2 from sentiment, 0.8 from latency, 0.4 from hallucination.” Anyone reading can see whether improvements are distributed across metrics or concentrated on ones that weren't hurting.
One step further: pair AIR with Critical-AIR, the subset of AIR spent on metrics currently below an acceptable threshold or tied to failure modes ranked by aggregate business impact — frequency × severity, not severity alone. A frequent minor leak and a rare catastrophic failure can both be critical; a severe failure on a code path that runs twice a month usually isn't. A team running at AIR 2.4/month and Critical-AIR 0.2/month is improving the wrong things — motion without progress. A healthy loop moves both numbers; improvement theater moves only the first.
Operationally: Critical-AIR sums the AIR units attributed to (a) metrics whose current value is below the team's published acceptable threshold at the start of the window, or (b) failure modes ranked in the top-k by frequency × severity. Publish the list alongside the number — otherwise Critical-AIR reduces to AIR with extra words.
Frequency-weighting credit: Shonnah Hughes's framing of “Agent Improvement Velocity” on LinkedIn — “how fast your AI gets better at its highest-frequency decisions” — sharpened the frequency dimension above.
Provisional bands
Illustrative, not benchmarked. One improvement cycle per sprint gives a focused agent about two candidate changes per month; real AIR sits below that ceiling because not every candidate passes all four gates. Elite teams run two or three parallel improvement streams. Negative AIR means the loop is producing verified regressions faster than verified improvements — different diagnostic category, not a weak number. DORA recalibrated its bands as survey data accumulated. AIR should do the same.
| Band | AIR / agent / month |
|---|---|
| Elite | 3+ |
| High | 1–2 |
| Medium | ~0.5 |
| Low | 0 to 0.25 |
| Losing ground | < 0 |
What AIR deliberately doesn't measure
Step-change model swaps that pass all four gates but aren't tied to a specific improvement cycle: they collapse into the count without capturing magnitude.
Dimensions you don't instrument. If it's not a tracked metric, it can't be an AIR unit. This is a feature: AIR is only as honest as the instrumentation it sits on.
Slow-burn regressions that surface after the verification window closes. Whatever window the team publishes catches most production effects; agent drift over a quarter is a different problem, handled by a different metric.
Impact. AIR counts every gate-passing improvement as one unit regardless of how often the affected code path runs or how severe the underlying failure was. A 0.1% lift on a dominant flow and a 0.1% lift on a rarely-run edge case look identical in the count. Critical-AIR partially addresses this by ranking failure modes on aggregate impact (frequency × severity); teams that care about this should rank explicitly, not implicitly.
A good AIR number means the improvement loop is working. It does not mean the agent is good on dimensions no one measured, or that the loop is aimed where the business actually hurts.
What AIR lets you say
Our support agent runs at 2.4 AIR/month. Sales qualification runs at 0.6/month. Why, and what should sales borrow?
Eighty percent of our AIR came from proactive improvements, not firefighting.
Headcount doubled this year. AIR didn't. Something in the process isn't scaling.
None of these sentences are available today without a named, gated, comparable number.
Closing
AIR is a proposal. The spec above is open. Measure your own, publish your floor and your tolerances, count only what clears all four gates. If the number is moving, the loop is working. If it isn't, you know where to look.
Comments, corrections, or adoptions: oren@converra.ai.