Skip to content

Re-Auditing

Track agent changes over time.

Why re-audit

The score on day one is a snapshot. The score on day thirty, after you've shipped four prompt changes and a model upgrade, is the question that matters: did the changes actually help?

Re-audit answers that. Each re-audit is compared against the previous run; the report tells you what changed.

How to re-audit

On any saved report, click Re-audit.

The new run uses the same URL and the same eval set as the original. Findings, strengths, and the score are computed fresh against the agent's current behavior.

What you see on the new report

The new report renders at a fresh /eval/r/[token] URL but includes a diff banner at the top:

  • Score delta73 → 81 (+8)
  • Verdict shiftFix first → Deploy, when applicable
  • Findings resolved — scenarios that failed before and pass now
  • Findings introduced — scenarios that passed before and fail now (the regression-watch case)
  • Strengths preserved / lost — to catch silent regressions in passing categories

The diff is the headline. Below it, the report renders the same way as a fresh audit.

Use the diff banner to talk to your team

The diff is the artifact you share when someone asks "did the prompt change help?"

  • Score went up: 73 → 81 is the proof
  • A regression appeared: the diff names the specific scenario that flipped
  • Nothing moved: the change was a wash; argue from the data instead of vibes

Eval set version pinning

Re-audits use the same eval set version as the original run. If the eval set has changed since the original, the re-audit is still comparable to the original — the rubric isn't shifting underneath you.

If you want to re-score an older report against a newer eval set, that's a separate audit, not a re-audit.

When the URL no longer works

Sometimes the URL goes down, moves, or changes auth. The re-audit will fail with a clear error and won't be charged.

To audit a new URL, submit it as a fresh audit instead of re-auditing the old one.