Evals and manual analysis tell you what failed. Turn-level attribution tells you why and where - so you fix the right layer the first time.
But not why - a low score doesn't explain whether the prompt, model, or tool caused it.
Production failures come from combinations and edge cases that static test sets don't cover.
The failure might have happened at turn 3 of 8 - by the final output, the root cause is buried.
Reading traces works for 10 conversations. At 1,000/day, you need automated attribution.
Based on patterns from production agent conversations, every issue traces back to one of these layers. Fixing the wrong layer wastes engineering time and leaves the real problem in place.
Ambiguous instructions, missing guardrails, conflicting directives, or overly aggressive thresholds in the system prompt cause the agent to behave incorrectly.
The agent loses track of earlier conversation context, retrieves wrong documents, or fails to carry forward critical information from previous turns.
The agent calls the right tool with wrong parameters, calls the wrong tool entirely, or fails to call a tool when it should.
The underlying model can't handle the task at the required quality level - reasoning failures, instruction following gaps, or output format inconsistencies.
The issue isn't in the agent's AI behavior at all - it's in the surrounding application code. API errors, incorrect data formatting, broken integrations, or misconfigured orchestration.
Every turn is recorded - system prompts, user inputs, agent responses, tool calls, retrieved documents. Nothing is summarized or dropped.
Each agent turn is evaluated against intent alignment, context usage, relevance, and tool appropriateness. This identifies the exact turn where the conversation diverged from the expected path.
The failing turn is analyzed against the prompt instructions active at that point, the context available, tools called, and model behavior. This produces a classification: prompt, context, tool, model, or code - with specific evidence.
A fix specific to the identified layer is generated, tested in simulation against the failure scenario and similar cases, regression-checked, and deployed with production verification.
Each conversation turn is scored independently against the agent's goals and behavioral expectations. When a turn scores below threshold, the system classifies why - analyzing the prompt instructions active at that turn, the context available, tools called, and model behavior. This produces a root cause classification (prompt, context, tool, model, or code) with specific evidence.
Evals grade outputs against expected results. Turn-level attribution diagnoses why an output went wrong by tracing through each conversation step. Evals tell you your agent scored 60% on a test set. Attribution tells you 55% of failures come from an overly aggressive disqualification threshold in paragraph 3 of your system prompt.
Yes. In multi-agent architectures, attribution traces which agent in the chain caused the failure - not just that the end-to-end result was wrong. When Agent A passes bad context to Agent B, the system attributes the failure to Agent A's context handling, not Agent B's response to bad input.
Converra generates a targeted fix for the identified layer - a prompt variant for prompt issues, a context restructuring for context loss, a tool description update for tool misuse. The fix is tested in simulation against the failure scenario and similar scenarios, regression-checked, and deployed with production verification.
Classification accuracy improves with data. The system uses the full conversation trace, not just the failing turn - it considers what information was available, what instructions were active, and what the agent did at each step. Ambiguous cases are flagged for review rather than silently misclassified.
Connect your agent and see exactly where issues originate - then watch Converra fix them automatically.
Start for free