Appearance
Re-Auditing
Track agent changes over time.
Why re-audit
The score on day one is a snapshot. The score on day thirty, after you've shipped four prompt changes and a model upgrade, is the question that matters: did the changes actually help?
Re-audit answers that. Each re-audit is compared against the previous run; the report tells you what changed.
How to re-audit
On any saved report, click Re-audit.
The new run uses the same URL and the same eval set as the original. Findings, strengths, and the score are computed fresh against the agent's current behavior.
What you see on the new report
The new report renders at a fresh /eval/r/[token] URL but includes a diff banner at the top:
- Score delta —
73 → 81(+8) - Verdict shift — Fix first → Deploy, when applicable
- Findings resolved — scenarios that failed before and pass now
- Findings introduced — scenarios that passed before and fail now (the regression-watch case)
- Strengths preserved / lost — to catch silent regressions in passing categories
The diff is the headline. Below it, the report renders the same way as a fresh audit.
Use the diff banner to talk to your team
The diff is the artifact you share when someone asks "did the prompt change help?"
- Score went up:
73 → 81is the proof - A regression appeared: the diff names the specific scenario that flipped
- Nothing moved: the change was a wash; argue from the data instead of vibes
Eval set version pinning
Re-audits use the same eval set version as the original run. If the eval set has changed since the original, the re-audit is still comparable to the original — the rubric isn't shifting underneath you.
If you want to re-score an older report against a newer eval set, that's a separate audit, not a re-audit.
When the URL no longer works
Sometimes the URL goes down, moves, or changes auth. The re-audit will fail with a clear error and won't be charged.
To audit a new URL, submit it as a fresh audit instead of re-auditing the old one.
