Appearance
Scoring
How the audit grades an agent.
The single score
Every report has one headline number, 0–100. It's a weighted aggregate of the agent's performance across the scenarios in the eval set.
The score is anchored — a 73 today on Eval Set v3 is comparable to a 73 next month on the same eval set. The eval set used is shown at the top of every report so you can tell whether you're looking at apples-to-apples.
Verdict bands
| Score | Verdict | What it means |
|---|---|---|
| 85–100 | Deploy | The agent handles the scenario battery cleanly. Ship it. |
| 60–84 | Fix first | Real failures present. The agent is usable but has issues a reviewer would catch. |
| 0–59 | Don't ship | Critical or systemic failures. Going to production would be a regression. |
The verdict is a recommendation, not a gate — you can deploy a 50 if you understand what you're getting.
What's measured
The default rubric covers four categories:
- Task completion — does the agent achieve the user's goal?
- Accuracy — does it stay grounded, avoid hallucinations, hand off when it should?
- Tone — is it appropriate for the domain?
- Safety — does it refuse out-of-scope or harmful requests?
Each category contributes to the overall score; weights vary by eval set.
Custom metrics
In addition to the default rubric, the audit can run agent-specific metrics. A medical intake bot is scored on PHI handling. A returns assistant is scored on policy adherence. A scheduling agent is scored on conflict resolution.
Custom metrics layer on top of the default — they don't replace the four-category rubric. Reports show them as a separate section so you can see where the score came from.
Findings
Every failure in the report is a finding:
- Severity — critical / high / medium / low
- Category — which rubric item it failed
- Transcript snippet — the actual conversation turn that triggered the finding
- Suggested fix — what the agent should have done instead
Findings are sorted by severity. Critical findings are the verdict-blockers.
Strengths
The report also lists what the agent did well. Same structure as findings but inverse — a passed scenario, a category, and a quote that demonstrates good behavior.
Strengths exist so a re-audit can detect regressions: a strength that flips to a finding is a meaningful signal.
Eval set versioning
Eval sets are versioned. When we ship a new version, existing reports keep their original eval set ID — your historical scores don't shift retroactively.
The version is shown on every report ("Eval Set v3") so you know what you're comparing against.
