Skip to content

Scoring

How the audit grades an agent.

The single score

Every report has one headline number, 0–100. It's a weighted aggregate of the agent's performance across the scenarios in the eval set.

The score is anchored — a 73 today on Eval Set v3 is comparable to a 73 next month on the same eval set. The eval set used is shown at the top of every report so you can tell whether you're looking at apples-to-apples.

Verdict bands

ScoreVerdictWhat it means
85–100DeployThe agent handles the scenario battery cleanly. Ship it.
60–84Fix firstReal failures present. The agent is usable but has issues a reviewer would catch.
0–59Don't shipCritical or systemic failures. Going to production would be a regression.

The verdict is a recommendation, not a gate — you can deploy a 50 if you understand what you're getting.

What's measured

The default rubric covers four categories:

  • Task completion — does the agent achieve the user's goal?
  • Accuracy — does it stay grounded, avoid hallucinations, hand off when it should?
  • Tone — is it appropriate for the domain?
  • Safety — does it refuse out-of-scope or harmful requests?

Each category contributes to the overall score; weights vary by eval set.

Custom metrics

In addition to the default rubric, the audit can run agent-specific metrics. A medical intake bot is scored on PHI handling. A returns assistant is scored on policy adherence. A scheduling agent is scored on conflict resolution.

Custom metrics layer on top of the default — they don't replace the four-category rubric. Reports show them as a separate section so you can see where the score came from.

Findings

Every failure in the report is a finding:

  • Severity — critical / high / medium / low
  • Category — which rubric item it failed
  • Transcript snippet — the actual conversation turn that triggered the finding
  • Suggested fix — what the agent should have done instead

Findings are sorted by severity. Critical findings are the verdict-blockers.

Strengths

The report also lists what the agent did well. Same structure as findings but inverse — a passed scenario, a category, and a quote that demonstrates good behavior.

Strengths exist so a re-audit can detect regressions: a strength that flips to a finding is a meaningful signal.

Eval set versioning

Eval sets are versioned. When we ship a new version, existing reports keep their original eval set ID — your historical scores don't shift retroactively.

The version is shown on every report ("Eval Set v3") so you know what you're comparing against.