Capability

Step-level diagnosis

Converra traces failures to the exact agent and exact turn in multi-agent conversations — with root cause classification and per-step scoring. Then fixes it automatically.

Conversation #4091 — SDR Agent

Score: 12Fix this →
Aggressive volume-only disqualification threshold in prompt
Step 1 · User

"We use smart badges for our events but need a less expensive alternative. Must share contact details between exhibitors and delegates. Sustainability is important."

Step 2 · Agent

Asks about event volume — how many events planned in the next 12 months. Good qualifying question.

Step 3 · User

"2 conferences, about 200 attendees each."

Step 4 · AgentRoot causePrompt Issue

"Based on your current event volume, we may not be the best fit." Redirected to a community page. Prospect dismissed.

Disqualified on attendee count alone — ignored product-fit signals (smart badges, contact sharing, sustainability).
Intent 25Relevance 15Context 20Tool Use 30

Three levels of diagnosis granularity

Conversation-level

"This conversation failed"

You know something went wrong but not where or why. You read the full transcript and guess.

Most observability tools, basic eval frameworks

Turn-level

"The agent's response at step 3 was bad"

Better — you know which response was wrong. But you still don't know if it was a prompt issue, model issue, or context issue.

Some eval platforms with per-turn scoring

Step-level (Converra)

"Step 3 failed because the agent ignored buying signals already stated — this is a prompt issue in the intent-matching instructions"

Root cause identified, fix can be generated automatically

Converra

Root cause classification

Every diagnosed step is classified by root cause type — so the fix targets the actual problem, not a symptom.

~85%

Prompt Issue

Instructions, goals, routing logic, or guardrails in the system prompt. Fixable automatically.

~3%

Model Mismatch

The model isn't suited for the task. Too slow, too expensive, or not capable enough for the required reasoning.

~2%

Config Error

Guardrail thresholds, temperature settings, or tool configurations that suppress the right behavior.

~10%

Code / Orchestration

Handoff logic, state management, or API integration issues. Diagnosed to the exact point — your team fixes in hours, not weeks.

Based on root cause analysis across 103 real production conversations.

Per-step scoring

Each agent response is scored on four dimensions. Low scores on a specific metric at a specific step tell you exactly what to fix.

Intent RecognitionDid the agent understand what the user was trying to accomplish?
RelevanceWas the response appropriate for this specific point in the conversation?
Context UtilizationDid the agent use information from earlier turns?
Tool UseDid the agent call the right tools at the right time?

From diagnosis to fix — automatically

Step-level diagnosis isn't just for reports. It feeds directly into the improvement loop.

1

Diagnose

Exact step + root cause

2

Generate

Targeted fix for that failure

3

Simulate

36+ conversations, head-to-head

4

Verify

Before/after from production

Frequently asked questions

What is step-level diagnosis for AI agents?

Step-level diagnosis identifies the exact turn in a multi-turn conversation where an AI agent's behavior caused a failure — and classifies the root cause (prompt issue, model mismatch, config error, or orchestration bug). This is more granular than conversation-level scoring ('this conversation failed') or turn-level scoring ('step 3 was bad'). Step-level diagnosis tells you why step 3 was bad and what type of fix will address it.

Why isn't conversation-level diagnosis enough?

A 5-turn conversation that fails at step 2 and a 5-turn conversation that fails at step 5 need completely different fixes. Conversation-level scoring tells you the conversation failed but not where. Without step-level diagnosis, engineers read full transcripts to find the problem — a process that doesn't scale as conversation volume grows.

How does step-level diagnosis work in multi-agent systems?

In multi-agent systems, Converra traces the conversation across agent boundaries. When a handoff between agents causes a failure, the diagnosis identifies both the handing-off agent and the receiving agent, the specific turn where the handoff broke, and whether the root cause is in the routing logic or the downstream agent's instructions.

What metrics are scored at each step?

Each agent response is scored on intent recognition (did it understand the user's goal), relevance (was the response appropriate), context utilization (did it use prior conversation history), and tool use (did it call the right tools). These per-step scores pinpoint exactly which capability failed.

Can I use step-level diagnosis without the full Converra loop?

Yes. Diagnosis is available on its own — you get the exact step, failure mode, and root cause classification. But the real value comes from the full loop: Converra takes that diagnosis, generates a targeted fix, tests it in simulation, and verifies the result from production data.

See diagnosis in action

Connect your agent and see exactly where conversations break — then watch the fix generate, test, and deploy automatically.

Start for free