Most teams can't tell improvement from noise
Most teams running agents in production ship occasional prompt changes. Some help. Some don't. Most never get measured against real traffic after deploy, so nobody really knows which. The agent's metrics drift a little over months — usually up, occasionally down, sometimes flat — and the team isn't sure how much of the change came from their work versus customers shifting versus the underlying model updating beneath them.
That isn't a quality problem. The agent is probably fine at any given moment. It's an improvement-loop problem. The team is shipping changes they can't verify, so their rate of real improvement is indistinguishable from random drift.
Most agent teams are in some version of this condition. The ones competing against them are in the same one.
The loop runs through engineering
The typical path for improving a production agent goes through product triage, engineering scoping, a sprint, a prompt change, manual testing, a deploy, and eventually someone checking whether it helped. Each meaningful change is weeks of calendar time. Most weeks, agent improvement loses the prioritization battle to feature work.
Every step of the loop is serialized through an engineer whose time is already committed. Diagnosis, fix generation, scenario design, deploy, post-deploy measurement — all of it human time. And the changes that do ship often aren't measured against production afterwards, so nobody knows which ones worked.
Nobody reports their AIR today. That's not because the number would be embarrassing. It's because computing it requires data the current stack doesn't join automatically, and the loop itself is too slow for the number to change much over short horizons.
Existing metrics don't catch this
Existing metrics describe where an agent is, not how fast it's moving. Win rate and task success are snapshots. Regression rate is defensive. DORA deployment frequency counts deploys, not quality change.
LangSmith, Braintrust, and Galileo come closest: lift-over-baseline summed across experiments is a rate of sorts. Call that Eval Improvement Rate. It counts lift that may never have reached production. Half of it doesn't.
The rate that matters is the one measured on real traffic after deploy. Nobody names it.
Agent quality is the moat. Velocity keeps it.
Customers buy better agents. Quality is what you compete on. The question is whether your quality advantage is defensible, or whether a competitor with the same model access and a similar starting point can catch up.
At manual-loop speed, they can. The specifics of any version you ship leak through traces, jailbreaks, churned employees, or straightforward reverse engineering. A competitor willing to grind can roughly match what you released this quarter.
At high enough improvement velocity, they can't. Each verified improvement raises the baseline the next is measured against, so the team with the faster loop pulls further ahead every cycle. By the time a competitor recreates your current prompt, you're on a version six improvements later. The gap is structural, not effort-based.
Traffic amplifies this. A high-traffic agent reaches statistical significance in hours; a low-traffic one might need days of real traffic to verify the same change. So scale itself accelerates the loop: the agent with more customers today compounds improvements faster than the one with fewer. An early quality-plus-traffic lead compounds into a quality lead that new entrants can't close by working harder.
That's why the rate deserves to be named and measured. Quality is what the customer sees. Velocity is what determines whether anyone can take it from you.
Definition
AIR = how many real improvements your agent ships to production per month.
Formally: count of verified production improvements per agent per unit time, where each improvement passes the four gates below.
- 1
Not noise — statistically real
Evidence at medium or higher, measured on head-to-head pairs between the new version and baseline. Absolute scores don't count.
- 2
Not a trade — non-regressive
Passes the golden-scenario regression suite. No other tracked metric drops by more than a team-published tolerance (default: 0.5 standard deviations on head-to-head pairs). You don't get to trade quality axes and call it improvement.
- 3
Not fake — production-verified
The target metric measured on real production traffic moves in the intended direction at p < 0.05. The verification window is whatever reaches that threshold at the team's published power target — hours for high-traffic agents, days or weeks for low-traffic ones. Publish it. See production verification.
- 4
Not trivial — above floor
The verified lift exceeds a published minimum (default: 1% absolute or 5% relative, whichever is higher for the metric). Publishing the floor matters more than the specific number; teams that set a 0.1% floor are telling you to ignore their AIR.
Any tracked metric qualifies: success rate, sentiment, cost per conversation, latency, tool-call precision, hallucination rate, containment. One improvement passes all four gates, one AIR unit. “Our support agent shipped 2.4 AIR last month” means 2.4 gate-passing improvements landed in production over 30 days.
AIR counts both reactive improvements (fixing a regression) and proactive ones (raising the ceiling on something that wasn't broken). A support agent at 4.3 sentiment pushed to 4.6, regression suite clean, verified in production, is 1 AIR on the sentiment axis — even though nothing was on fire.
Gate 3 is the load-bearing one
All four gates are discipline. Gate 3 is where AIR separates from Eval Improvement Rate.
Half of eval lift doesn't survive contact with production distribution. Scenario coverage is partial, persona drift is real, and production mixes keep shifting. Teams that report only eval lift are telling you what they hoped would happen.
AIR forces the verification onto real traffic. That's what makes the number expensive to produce and hard to game.
Not all AIR is equal
The four gates make sure an improvement is real. They don't make sure it matters.
A team can run at 3 AIR per month with every unit spent on metrics that were already fine — sentiment lifts on an agent already at 95% task success, latency shaves on an agent within acceptable latency — while the actual business pain (hallucinated invite links, failed handoffs, cost blowup on a specific intent) goes untouched. The AIR number looks elite. The business doesn't get better.
So AIR should always be reported with attribution, broken down by the metric moved: “Q2 AIR: 2.4 total — 1.2 sentiment, 0.8 latency, 0.4 hallucination.” Anyone reading can see whether improvements are distributed across metrics or concentrated on ones that weren't hurting.
One step further: pair AIR with Critical-AIR, the subset of AIR spent on metrics currently below an acceptable threshold or tied to failure modes ranked by aggregate business impact — frequency × severity, not severity alone. A frequent minor leak and a rare catastrophic failure can both be critical; a severe failure on a code path that runs twice a month usually isn't. A team with AIR 2.4 and Critical-AIR 0.2 is improving the wrong things — motion without progress. A healthy loop moves both numbers; improvement theater moves only the first.
Frequency-weighting credit: Shonnah Hughes's framing of “Agent Improvement Velocity” on LinkedIn — “how fast your AI gets better at its highest-frequency decisions” — sharpened the frequency dimension above.
Provisional bands
Illustrative, not benchmarked. The numbers come from a back-of-envelope: one improvement cycle per sprint lands 2 AIR per month for a focused agent. Elite teams run two or three parallel improvement streams. DORA recalibrated its bands as survey data accumulated. AIR should do the same.
| Band | AIR / agent / month |
|---|---|
| Elite | 3+ |
| High | 1–2 |
| Medium | ~0.5 |
| Low | < 0.25 |
What AIR deliberately doesn't measure
Step-change model swaps that pass all four gates but aren't tied to a specific improvement cycle: they collapse into the count without capturing magnitude.
Dimensions you don't instrument. If it's not a tracked metric, it can't be an AIR unit. This is a feature: AIR is only as honest as the instrumentation it sits on.
Slow-burn regressions that surface after the verification window closes. Whatever window the team publishes catches most production effects; agent drift over a quarter is a different problem, handled by a different metric.
Impact. AIR counts every gate-passing improvement as one unit regardless of how often the affected code path runs or how severe the underlying failure was. A 0.1% lift on a dominant flow and a 0.1% lift on a rarely-run edge case look identical in the count. Critical-AIR partially addresses this by ranking failure modes on aggregate impact (frequency × severity); teams that care about this should rank explicitly, not implicitly.
A good AIR number means the improvement loop is working. It does not mean the agent is good on dimensions no one measured, or that the loop is aimed where the business actually hurts.
What AIR lets you say
Our support agent runs at 2.4 AIR. Sales qualification runs at 0.6. Why, and what should sales borrow?
Eighty percent of our AIR came from proactive improvements, not firefighting.
Headcount doubled this year. AIR didn't. Something in the process isn't scaling.
None of these sentences are available today without a named, gated, comparable number.
Closing
AIR is a proposal. The spec above is open. Measure your own, publish your floor and your tolerances, count only what clears all four gates. If the number is moving, the loop is working. If it isn't, you know where to look.
Comments, corrections, or adoptions: oren@converra.ai.