Metric proposal

Your agent's improvement rate is a liability (or a moat)

Agent Improvement Rate (AIR) measures which one — the count of verified production improvements per agent per unit time, gated so the number means something when a team reports it.

Most teams can't tell improvement from noise

Most teams running agents in production ship occasional prompt changes. Some help. Some don't. Most never get measured against real traffic after deploy, so nobody really knows which. The agent's metrics drift a little over months — usually up, occasionally down, sometimes flat — and the team isn't sure how much of the change came from their work versus customers shifting versus the underlying model updating beneath them.

That isn't a quality problem. The agent is probably fine at any given moment. It's an improvement-loop problem. The team is shipping changes they can't verify, so their rate of real improvement is indistinguishable from random drift.

Most agent teams are in some version of this condition. The ones competing against them are in the same one.

The loop runs through engineering

The typical path for improving a production agent goes through product triage, engineering scoping, a sprint, a prompt change, manual testing, a deploy, and eventually someone checking whether it helped. Each meaningful change is weeks of calendar time. Most weeks, agent improvement loses the prioritization battle to feature work.

Every step of the loop is serialized through an engineer whose time is already committed. Diagnosis, fix generation, scenario design, deploy, post-deploy measurement — all of it human time. And the changes that do ship often aren't measured against production afterwards, so nobody knows which ones worked.

Nobody reports their AIR today. That's not because the number would be embarrassing. It's because computing it requires data the current stack doesn't join automatically, and the loop itself is too slow for the number to change much over short horizons.

Existing metrics don't catch this

Existing metrics describe where an agent is, not how fast it's moving. Win rate and task success are snapshots. Regression rate is defensive. DORA deployment frequency counts deploys, not quality change.

LangSmith, Braintrust, and Galileo come closest: lift-over-baseline summed across experiments is a rate of sorts. Call that Eval Improvement Rate. It counts lift that may never have reached production. Half of it doesn't.

The rate that matters is the one measured on real traffic after deploy. Nobody names it.

Agent quality is the moat. Velocity keeps it.

Customers buy better agents. Quality is what you compete on. The question is whether your quality advantage is defensible, or whether a competitor with the same model access and a similar starting point can catch up.

At manual-loop speed, they can. The specifics of any version you ship leak through traces, jailbreaks, churned employees, or straightforward reverse engineering. A competitor willing to grind can roughly match what you released this quarter.

At high enough improvement velocity, they can't. Each verified improvement raises the baseline the next is measured against, so the team with the faster loop pulls further ahead every cycle. By the time a competitor recreates your current prompt, you're on a version six improvements later. The gap is structural, not effort-based.

Traffic amplifies this. A high-traffic agent reaches statistical significance in hours; a low-traffic one might need days of real traffic to verify the same change. So scale itself accelerates the loop: the agent with more customers today compounds improvements faster than the one with fewer. An early quality-plus-traffic lead compounds into a quality lead that new entrants can't close by working harder.

That's why the rate deserves to be named and measured. Quality is what the customer sees. Velocity is what determines whether anyone can take it from you.

Definition

AIR = how many real improvements your agent ships to production per month.

Formally: count of verified production improvements per agent per unit time, where each improvement passes the four gates below.

  1. 1

    Not noise — statistically real

    Evidence at medium or higher, measured on head-to-head pairs between the new version and baseline. Absolute scores don't count.

  2. 2

    Not a trade — non-regressive

    Passes the golden-scenario regression suite. No other tracked metric drops by more than a team-published tolerance (default: 0.5 standard deviations on head-to-head pairs). You don't get to trade quality axes and call it improvement.

  3. 3

    Not fake — production-verified

    The target metric measured on real production traffic moves in the intended direction at p < 0.05. The verification window is whatever reaches that threshold at the team's published power target — hours for high-traffic agents, days or weeks for low-traffic ones. Publish it. See production verification.

  4. 4

    Not trivial — above floor

    The verified lift exceeds a published minimum (default: 1% absolute or 5% relative, whichever is higher for the metric). Publishing the floor matters more than the specific number; teams that set a 0.1% floor are telling you to ignore their AIR.

Any tracked metric qualifies: success rate, sentiment, cost per conversation, latency, tool-call precision, hallucination rate, containment. One improvement passes all four gates, one AIR unit. “Our support agent shipped 2.4 AIR last month” means 2.4 gate-passing improvements landed in production over 30 days.

AIR counts both reactive improvements (fixing a regression) and proactive ones (raising the ceiling on something that wasn't broken). A support agent at 4.3 sentiment pushed to 4.6, regression suite clean, verified in production, is 1 AIR on the sentiment axis — even though nothing was on fire.

Gate 3 is the load-bearing one

All four gates are discipline. Gate 3 is where AIR separates from Eval Improvement Rate.

Half of eval lift doesn't survive contact with production distribution. Scenario coverage is partial, persona drift is real, and production mixes keep shifting. Teams that report only eval lift are telling you what they hoped would happen.

AIR forces the verification onto real traffic. That's what makes the number expensive to produce and hard to game.

Not all AIR is equal

The four gates make sure an improvement is real. They don't make sure it matters.

A team can run at 3 AIR per month with every unit spent on metrics that were already fine — sentiment lifts on an agent already at 95% task success, latency shaves on an agent within acceptable latency — while the actual business pain (hallucinated invite links, failed handoffs, cost blowup on a specific intent) goes untouched. The AIR number looks elite. The business doesn't get better.

So AIR should always be reported with attribution, broken down by the metric moved: “Q2 AIR: 2.4 total — 1.2 sentiment, 0.8 latency, 0.4 hallucination.” Anyone reading can see whether improvements are distributed across metrics or concentrated on ones that weren't hurting.

One step further: pair AIR with Critical-AIR, the subset of AIR spent on metrics currently below an acceptable threshold or tied to failure modes ranked by aggregate business impact — frequency × severity, not severity alone. A frequent minor leak and a rare catastrophic failure can both be critical; a severe failure on a code path that runs twice a month usually isn't. A team with AIR 2.4 and Critical-AIR 0.2 is improving the wrong things — motion without progress. A healthy loop moves both numbers; improvement theater moves only the first.

Frequency-weighting credit: Shonnah Hughes's framing of “Agent Improvement Velocity” on LinkedIn — “how fast your AI gets better at its highest-frequency decisions” — sharpened the frequency dimension above.

Provisional bands

Illustrative, not benchmarked. The numbers come from a back-of-envelope: one improvement cycle per sprint lands 2 AIR per month for a focused agent. Elite teams run two or three parallel improvement streams. DORA recalibrated its bands as survey data accumulated. AIR should do the same.

BandAIR / agent / month
Elite3+
High1–2
Medium~0.5
Low< 0.25

What AIR deliberately doesn't measure

Step-change model swaps that pass all four gates but aren't tied to a specific improvement cycle: they collapse into the count without capturing magnitude.

Dimensions you don't instrument. If it's not a tracked metric, it can't be an AIR unit. This is a feature: AIR is only as honest as the instrumentation it sits on.

Slow-burn regressions that surface after the verification window closes. Whatever window the team publishes catches most production effects; agent drift over a quarter is a different problem, handled by a different metric.

Impact. AIR counts every gate-passing improvement as one unit regardless of how often the affected code path runs or how severe the underlying failure was. A 0.1% lift on a dominant flow and a 0.1% lift on a rarely-run edge case look identical in the count. Critical-AIR partially addresses this by ranking failure modes on aggregate impact (frequency × severity); teams that care about this should rank explicitly, not implicitly.

A good AIR number means the improvement loop is working. It does not mean the agent is good on dimensions no one measured, or that the loop is aimed where the business actually hurts.

What AIR lets you say

Our support agent runs at 2.4 AIR. Sales qualification runs at 0.6. Why, and what should sales borrow?
Eighty percent of our AIR came from proactive improvements, not firefighting.
Headcount doubled this year. AIR didn't. Something in the process isn't scaling.

None of these sentences are available today without a named, gated, comparable number.

Closing

AIR is a proposal. The spec above is open. Measure your own, publish your floor and your tolerances, count only what clears all four gates. If the number is moving, the loop is working. If it isn't, you know where to look.

Comments, corrections, or adoptions: oren@converra.ai.

FAQ

What is Agent Improvement Rate?

Agent Improvement Rate (AIR) is the count of verified production improvements per agent per unit time. An improvement only counts if it passes four gates: statistically real (head-to-head evidence at medium or higher), non-regressive (regression suite passes, no other tracked metric degrades), production-verified on real traffic within a published window, and above a published minimum lift threshold.

How is AIR different from win rate or task success?

Win rate and task success are point-in-time scores. AIR is a derivative — how fast the agent is getting better. A 92% success rate tells you where you are; AIR tells you whether the team running the agent is compounding improvements or standing still.

Isn't this just Eval Improvement Rate with extra steps?

Close. Eval platforms already sum lift-over-baseline across experiments, which is a rate. The difference is gate 3: production verification. Half of eval lift doesn't survive contact with real production distribution. AIR only counts the improvements that actually moved the metric on real traffic after deploy.

Why does "meaningful" need a threshold?

Without one, AIR rewards noise. A 0.2% bump on a volatile metric counts the same as a 4% lift that held for two weeks in production. Thresholds also block the obvious gaming path: splitting one real improvement into twenty micro-PRs to inflate the rate.

Does AIR only count fixes to broken behavior?

No. AIR counts both reactive improvements (closing a regression, fixing a production failure mode) and proactive ones (raising the ceiling on behavior that was already working). An agent with no open failures that lifts tool-precision from 94% to 96% is still improving.

What's a good AIR number?

Nobody knows yet. The bands on this page (3+ elite, 1–2 high, ~0.5 medium, <0.25 low) are a back-of-envelope: one improvement cycle per sprint lands around 2 AIR/month per focused agent, and elite teams run parallel streams. They're placeholders until real industry data exists. DORA recalibrated its bands as survey data accumulated; AIR should follow the same pattern.

Can I have high AIR and still be losing?

Yes. The four gates verify improvements are real, not that they matter. An agent with 3 AIR per month entirely on metrics that were already fine is motion without progress — improvement theater. That's why AIR should be reported broken down by metric, and ideally paired with Critical-AIR: the subset of AIR spent on metrics currently below an acceptable threshold or tied to top-ranked failure modes. A healthy loop moves both; a theater loop moves only the first.

Why per-agent and not per-team?

Agents have wildly different baselines, surface areas, and traffic. A billing agent at 97% has almost no headroom; a new triage agent at 70% has enormous headroom. Team-level averaging hides which agents are moving and which are stuck, and lets one fast-moving agent mask five stagnant ones.

Is AIR a Converra-proprietary metric?

No. The spec is open. Any team or vendor can compute and report AIR against it. Converra happens to build the loop that closes the four gates end-to-end, which is why we're publishing the term, but the point is a shared unit of progress for the field.