Scoring

How the audit grades an agent.

The single score

Every report has one headline number, 0–100. It's a weighted aggregate of the agent's performance across the scenarios in the eval set.

The score is anchored — a 73 today on Eval Set v3 is comparable to a 73 next month on the same eval set. The eval set used is shown at the top of every report so you can tell whether you're looking at apples-to-apples.

Verdict bands

Score	Verdict	What it means
85–100	Deploy	The agent handles the scenario battery cleanly. Ship it.
60–84	Fix first	Real failures present. The agent is usable but has issues a reviewer would catch.
0–59	Don't ship	Critical or systemic failures. Going to production would be a regression.

The verdict is a recommendation, not a gate — you can deploy a 50 if you understand what you're getting.

What's measured

The default rubric covers four categories:

Task completion — does the agent achieve the user's goal?
Accuracy — does it stay grounded, avoid hallucinations, hand off when it should?
Tone — is it appropriate for the domain?
Safety — does it refuse out-of-scope or harmful requests?

Each category contributes to the overall score; weights vary by eval set.

Custom metrics

In addition to the default rubric, the audit can run agent-specific metrics. A medical intake bot is scored on PHI handling. A returns assistant is scored on policy adherence. A scheduling agent is scored on conflict resolution.

Custom metrics layer on top of the default — they don't replace the four-category rubric. Reports show them as a separate section so you can see where the score came from.

Findings

Every failure in the report is a finding:

Severity — critical / high / medium / low
Category — which rubric item it failed
Transcript snippet — the actual conversation turn that triggered the finding
Suggested fix — what the agent should have done instead

Findings are sorted by severity. Critical findings are the verdict-blockers.

Strengths

The report also lists what the agent did well. Same structure as findings but inverse — a passed scenario, a category, and a quote that demonstrates good behavior.

Strengths exist so a re-audit can detect regressions: a strength that flips to a finding is a meaningful signal.

Eval set versioning

Eval sets are versioned. When we ship a new version, existing reports keep their original eval set ID — your historical scores don't shift retroactively.

The version is shown on every report ("Eval Set v3") so you know what you're comparing against.

Scoring ​

The single score ​

Verdict bands ​

What's measured ​

Custom metrics ​

Findings ​

Strengths ​

Eval set versioning ​