A timeline of features, improvements, and integrations across the Converra platform.
Conversation Insights now grounds every finding in evidence. Each issue is checked against what the agent actually ran — the exact prompt and runtime configuration, the specific transcript turns it's based on, and deterministic URL checks — and the judge marks a claim unverified instead of asserting it when the source isn't in the captured trace. The result is a step-change in accuracy over the past two weeks: fabricated and misattributed findings are largely gone, and every surfaced issue links to the precise step and rule behind it.
The conversation view now leads with a plain-language summary and unifies every issue into one set of finding cards, grouped by the output field they affect. Open any finding to see its evidence — the transcript turns, the prompt rule, and the runtime context it rests on — so the judgment shows its work inline. Cleaner density, local-time dates, and the analyzed date on every insight.
Conversation Insights now generates completely and reliably. Behavioral and absence findings — “skipped qualification”, “exceeded the length cap”, “ignored prior context” — that don't map to a single verbatim quote now ground against a window of transcript turns instead of being dropped, and the summary and root cause keep every finding they can support rather than collapsing to a generic fallback when one reference doesn't resolve. Over-length or partially malformed model output is repaired and recorded instead of discarded, and batch generation is provider-neutral — so an insight is no longer lost over a few characters or a single unresolved reference.
You can now flag any finding as a false positive or as by-design behavior straight from the conversation view. The verdict is recorded against that finding and carries into how the agent's future conversations are judged — so a correction sticks instead of resurfacing.
Agents and prompts now show human-readable names across the app instead of hash-qualified identifiers. Logical role labels — orchestrator, responder, and named agents — replace raw content hashes in conversation views, scorecards, and finding context, and known brand/prompt names are surfaced where we can resolve them, so you can tell at a glance which agent a finding is about.
Tool-calling analysis now works for LangChain agents imported via LangSmith. A LangChain trace serializes tool calls differently from raw OpenAI or Anthropic, so a tool-using agent previously imported with zero tool calls — silently hiding its tool usage. Converra now reads LangChain's message format, capturing both the tool calls and the bound tool definitions, so tool-call grading lights up automatically for LangSmith customers.
PR Impact no longer credits a fix when its effect can't be isolated. A deployment whose change landed alongside other changes — so the before/after is confounded — is flagged and held back from the impact rollup instead of being reported as a clean win. PR Impact is now also available over the MCP API, so you can pull a deployment's measured effect programmatically.
The conversation success score now grades how well the agent executed — did it answer accurately and advance or qualify the conversation — rather than whether the user happened to convert. An agent isn't penalized for an outcome it doesn't control, but a passive “answers the question and stops” agent still scores low, because answering isn't advancing. Scores were recalibrated under a stamped rubric version.
The agent page now opens with a State of the Agent verdict — a plain-language read on how the agent is performing, what's working, what isn't, and the top fixes to do next, sitting right above the scorecard tiles. It surfaces the same agent-level analysis Converra already generates, promoted out of a collapsed panel so the verdict is the first thing you see.
Output Quality now rolls up from conversation-level artifact checks into a per-agent scorecard on the agent page — output score, issue and concern counts, affected conversations, weakest dimensions, and a direct conversation receipt for each surfaced example.
A new Tool Calling page scores how well each agent actually uses its tools — tool success rate, argument validity, missed tools, and per-tool health across description, parameters, and response — then surfaces the exact conversations where a tool call went wrong. Import an agent's tool definitions (paste an OpenAI or Anthropic tools array) to unlock argument-validity and schema-quality grading; or skip the paste entirely — the Converra AI SDK middleware captures them automatically, and the Node and Python SDKs now forward them too.
The Converra AI SDK middleware (v0.2.1) now captures the tools array you pass to the model — including AI SDK v5/v6 `params.tools` — and forwards it as tool definitions, with matching support in the Python SDK (v0.3.4). Tool-calling grading — argument validity and description / parameter quality — lights up automatically for instrumented agents, with no manual import.
Every issue flagged for a conversation — insight, tool-call, and output-quality findings — now lives in one Conversation Findings panel instead of scattered sections. Each finding anchors to the exact transcript turns it's based on, and cited URLs are checked against the captured runtime context, so the judgment shows its receipts inline. Cleaner row layout, no claim without evidence.
The PR Impact view now shows how prevalent each targeted failure pattern was across a deployment's conversations, with a drilldown into the specific patterns — so you can see not just that a fix shipped, but how much of the fleet it actually touched.
Agent and prompt list views now read each agent's current production score from a maintained value instead of recomputing every agent's 30-day rolling score on every page load. The score refreshes the moment a conversation is scored, with a nightly pass for window aging — so large fleets that previously took several seconds to load now render quickly, with no change to how scores are calculated.
A new Cost page — and a lens on the Agents view — tracks spend, token usage, and cost efficiency per agent, including cost per conversation and per resolution, ranked by measured spend. A model-switch simulator projects what moving an agent to a cheaper model would save, so you can weigh the trade-off before changing anything. Powered by token usage captured from LangSmith, Langfuse, or the API.
Fleet and agent views now offer a Last 90 days option in the time filter, alongside 24h / 7d / 30d / all — useful for slower-moving trends and seasonality that a 30-day window misses.
The agent analysis view (`/agents/:id`) now has a sticky "On this page" nav that groups its sections — Overview, Trace Analysis, Content Analysis, Tools, Optimization & Versions — with scrollspy highlighting and hash deep-links. The page was previously one long scroll with no way to jump between analysis types.
The `get_conversation` MCP tool now accepts an external session id — your LangSmith, Langfuse, or Converra SDK `sessionId` — as an alternative to a Converra `conversationId`, resolving it through the canonical by-session route. Coding agents can pull a conversation straight from the id they already hold.
Conversations now carry end-to-end trace lineage from your runtime. A new `GET /api/v1/traces/[traceId]/lineage` endpoint resolves the full chain behind a conversation, and the SDK surfaces it — so a Converra conversation links back to the exact trace that produced it.
Tool and function calls in a conversation now render in an always-visible Tool Execution panel — tools available vs. called, uncalled tools, and a per-call input→result drilldown. Tool execution is normalized into a single redacted read model, projected to the v1 API and MCP, and the SDK exposes nested tool traces. Tool-using conversations ingested via SDK/API no longer show empty tool data.
Evals now detect the outputs an agent produces — structured output or substantive tool-call results — and score them against the agent's own instructions across structure and methodology. Generic by design, so it works for outlines, plans, form specs, configs, or queries. Shown as Output Quality in the conversation view, and on `ConversationInsights.artifacts` via the v1 insights API and the `get_conversation_insights` MCP tool.
The conversations list can now be filtered by end-user and by any metadata you attach at ingest, so you can slice to a single customer or cohort without exporting.
Claude Opus 4.8 is now available across the model catalog, runtime registry, and per-model guidance — 1M context, 128k output, and meaningfully more reliable at catching code flaws. Opus 4.7 is retained for stored-document compatibility; no defaults were changed.
Transcript diagnosis is now anchored to the insights LLM's parsed root cause instead of running blind, and it diagnoses clean user/assistant turns rather than a system-prompt-padded slice — so the diagnosis matches the rest of the insight. Step diagnosis is also overwritten on regeneration instead of preserving stale results.
Insights can now generate an optional plain-language summary aimed at your end users, configured per customer via the v1 settings API or the `update_settings` MCP tool. Off by default; when enabled it's surfaced everywhere the technical insight is — auth + v1 routes, MCP, the `insights.generated` webhook, the SSE stream, and the conversation view.
New `/fleet/pr-impact` page — also embedded inline in Fleet — tracks each merged PR's outcome on production: conversations counted, score delta, verdict, and a per-PR drilldown into the affected agents.
New `GET /api/v1/conversations/by-session/[sessionId]` resolves a Converra conversation directly from your own session id — no MCP roundtrip required. Appending a message to an unknown session id auto-creates the conversation.
Every LLM-judged finding in the Conversation Findings panel now anchors its claims to the specific transcript turns it references. The judgment shows its receipts inline instead of asking you to trust it.
Insights now validate every URL they cite against the captured runtime context. Fabricated link claims are flagged at generation time instead of slipping into reports.
Node SDK 0.5.0, Python SDK 0.3.1, and the Vercel AI SDK middleware 0.1.2 all point at the canonical `https://converra.ai/api/v1` base URL. The Node SDK now unwraps the response envelope by default — `conv.id` reads directly from single-resource gets instead of `response.data.id`. Middleware flushes events immediately on capture instead of batching to 10.
`get_agent` and `list_agents` now return real performance scores, conversation counts, and last-analyzed timestamps from PromptMetrics — not fake-zero placeholders. Missing scores render as null, matching how `get_fleet_overview` reports them.
`get_fleet_overview` now answers "is the fleet getting better or worse, and where?" — agents and patterns are tagged with movement direction and confidence, so coding agents can triage what changed without a separate query.
Optimization detail now leads with the deployment verdict and surfaces a before/after diff of the prompt change, so the outcome and what shipped are visible without scrolling.
Node SDK 0.4.1 closes traffic-assignment gaps, preserves SDK-side test assignments across rollouts, and hardens the production A/B test rollout path.
Generated variants must conform to the prompt contract before entering simulation. Rejections are recovered as warnings and fed back into the next generation pass instead of failing the optimization.
Optimization selection, evidence, and insights now flow only through head-to-head pairs between baseline and each variant. Sub-1pp lifts are preserved, diagnosis-first variant generation is enforced, and intent constraints are aligned across every trigger path.
Status-led triage rows lead with an impact-magnitude delta chip, action patterns replace the attention card, and rows drill into pattern evidence. New triage model splits needs_fix into untriaged and fix_ready, with auto-demotion on regression. Start fix opens inline from the row.
Regression gating now uses score-delta with variance-derived slack instead of pass-rate flips, eliminating false-positive demotions. Added aggregate demotion safety net, llmParametersHash in semantic-cache provenance with re-qualification on mismatch, and a configurable maxRegressions limit.
Test a recommended fix on a controlled slice of live SDK traffic before full rollout. Configure traffic, compare baseline vs. candidate fix on scored production conversations, then promote or stop based on lift and confidence.
Audit reports now test how an agent handles AI-mediated traffic: identity acknowledgement, structured intent, multi-step delegation, and bot-block transparency.
Re-audits now highlight issues that fired in the previous run but no longer reproduce, with validated/likely-fixed confidence labels and a dedicated Fixed tab.
Weekly emails now summarize what changed across the fleet: verified wins, regressions, unfixed patterns, recent PRs, fleet score, failure rate, and conversations analyzed.
Public audit submissions now use bot checks, daily capacity limits, and email verification before execution, keeping free audits reliable without requiring a full account upfront.
Paste any AI chat agent's URL, get a graded audit at a shareable token URL in under a minute. No account required. Save the report to a Converra account to track changes over time, or share it publicly to settle 'is this agent any good?' debates.
Learn moreNew pricing structure: Free, Pay-as-you-go ($9 per audit), Pro ($299/mo with 15 audits + overage), Enterprise. Annual/monthly toggle. PAYG runs through a single Stripe-hosted checkout — no contract, no commitment.
Learn moreAudit reports include an agent-specific custom metrics section — a medical intake bot is graded against different criteria than a returns assistant. The default rubric still applies; custom metrics layer on top.
Re-audit a saved report and see what actually changed: score deltas per dimension, scenarios that flipped pass/fail, and a verdict shift banner at the top. Tracks agent improvement over time without manual comparison.
Save any audit report to a Converra account with a single magic-link click — no password setup. Anonymous reports you ran before signing up automatically attach to the new account by browser attribution.
Reports now show the eval set used for scoring. A 73 on Eval Set v3 is comparable across re-audits — the rubric isn't silently shifting underneath you.
Three new fields on optimization results explain why a winner was picked: confidence score, plain-language selection reasoning, and per-scenario regression results — so you can see how each variant performed against each test case. Plus a new validate_variant MCP tool for running ad-hoc validations.
Optimization process status now flows from a single derived source — eliminating cases where the UI showed 'running' after the backend had finished, and where auto-restart fired on terminal failures. The status displayed in dashboards and PRs is always the canonical one.
If you run the same agent (Discovery, Support, etc.) across multiple end-customers, Converra now treats the whole role as a family. PRs surface a 'Role family' section listing every customer the change touches, and API + MCP responses include agentType, customer, and instanceCount so any consumer immediately sees the 5-agents-across-30-customers shape instead of a flat list.
Jump straight to the conversation that's hurting your agent the most, see the diagnosis, and ship the fix in one click. No digging through dashboards to figure out where to start.
Get a Slack ping the moment Converra flags a bad conversation — with a link straight to the diagnosis. Stop finding out about quality issues from your customers.
New append endpoint lets you stream turns into an existing conversation instead of waiting for it to end and posting the full transcript. Lower memory, real-time insights, and better fit for long-running voice and chat sessions.
Learn moreShare a single link to bring teammates into your Converra account. No more one-by-one email invites or waiting for approvals.
Pending-deployment and template aggregate pages now lead with critical issues instead of burying them under noise. The things that actually break your agent show up at the top.
Live score window widened from 7 to 30 days, and the minimum sample size dropped from 15 to 5. Agents with modest traffic now keep a stable, meaningful score instead of going dark between bursts.
One method to log conversations. Pass your messages and Converra handles the rest — create a new conversation or append turns to an existing one. Each message can optionally carry model, tool calls, token usage, and latency so you get trace-level detail without a tracing pipeline. Available in both Node.js and Python SDKs.
Learn moreWhen a prompt orchestrates multiple sub-agents (greeter, researcher, closer), failures are now attributed to the specific agent that caused them instead of defaulting to the primary. Cleaner diagnosis, targeted fixes.
Six new MCP tools that answer 'how is Converra helping my agents?' Fleet overview with real fleet score and failure rate. Daily score timeline with deployment markers. Cumulative impact summary. Verification evidence — the actual conversations behind every claim. Fixability breakdown showing what Converra can fix vs what needs engineering.
Every verification claim now has receipts. When Converra says '82% reduction,' you can drill into the actual pre-fix and post-fix conversations that were counted. Evidence is captured at verification time and retrievable via MCP or API.
Step failure aggregation now shows what percentage of failures Converra can fix autonomously vs what needs engineering work. Matches the Failure Triage card on the Fleet page.
When a fix resolves one failure, conversations survive longer and may hit new issues downstream. Converra now detects these cascade effects and surfaces them on the fleet card — so you know the fix worked even when the overall score doesn't move.
Deployed fixes are now verified against production conversations. Fleet cards show before/after failure rates with 'Fixed — X% reduction' badges. Verification is graduated: monitoring → likely fixed → verified — so you see early signals before waiting for full statistical confidence.
When an optimization doesn't find a winner, Converra now classifies why it failed, learns from the outcome, and automatically restarts with an adapted strategy. The same improvement loop we run on your agents now runs on ours.
Agent instructions now have full version history with lineage tracking. When one agent in a sibling group gets optimized, the others show staleness indicators so you know which agents are falling behind.
Preview the full pull request before creating it — the diff, metrics comparison, diagnosed issues, regression results, and conversation replay. Review everything in one place before committing to the PR.
Optimization PRs are now structured for AI code reviewers. Full context, metrics, evidence, and before/after comparisons are embedded inline so Claude Code, Copilot, and other AI reviewers can evaluate the change without needing access to Converra.
Every optimization now shows side-by-side comparisons of how the agent responded before and after the change. See the actual improvement in context, not just a score delta.
Instead of a separate email for every diagnosed conversation, you now get one agent-level alert that summarizes all issues across conversations. Less noise, same signal.
The copy fix button now includes testing methodology, regression results, real production conversation quotes, and review guidance — everything a reviewer needs without opening Converra.
Define evaluation rules at the organization level that apply across all your agents. Custom rules are plumbed directly into evaluation prompts so every optimization and insight respects your business-specific quality standards.
Each agent in a multi-agent conversation now gets its own scoped insights. Secondary agents surface their own failure patterns and performance metrics instead of being rolled into the primary agent's analysis.
Agent and fleet insights now show behavior-specific failure patterns with real conversation counts — not generic buckets. Each pattern links directly to the affected conversations. Cost and upside cards estimate the business impact of fixing the top issues.
Agent issues are now deep business insights, not metric labels. Each issue includes a headline, evidence from diagnosed conversations, a recommended fix, the team that owns it, and whether Converra can auto-fix it via prompt optimization.
Converra now creates pull requests in your GitHub repos when optimizations find improvements. Connect your GitHub, and optimization winners are automatically sent as PRs with metrics, evidence, and a one-click merge path. Supports auto-PR on completion, manual PR creation, merge-back sync, and Python agent file detection.
Learn moreWhen a benchmark finds a better model, Converra automatically opens a GitHub PR to switch the model config in your repo — complete with a comparison table showing quality score, cost, and latency across difficulty levels.
Model benchmarks now live on their own page with a dedicated nav entry. Browse all benchmark runs, view inline conversation scores, and launch new comparisons without leaving context.
See every agent's health at a glance. A single dashboard with optimization progress over time, a scoreboard ranking agents by performance, failure distribution, improvement potential, and pending deploys — everything you need to decide what to work on next.
The guided tour now starts with the Fleet page and walks through connecting your first agent. A faster path from signup to value.
Import converra/auto and every LLM call in your app is captured automatically — no wrapper functions, no code changes. Works with OpenAI, Anthropic, and Vercel AI SDK.
Langfuse and OpenTelemetry integrations now match LangSmith feature-for-feature — async sync triggers, pre-flight validation, usage limit checks, and import metrics.
Wrap your OpenAI, Anthropic, or Vercel AI SDK client with one line. Every conversation is captured automatically. Multi-agent tracing links orchestrator and sub-agent calls into a single execution graph. A/B variant swapping tests optimized prompts against real traffic.
Learn morepip install converra — Python SDK with sync/async/streaming support for OpenAI and Anthropic. LangChain callback handler included.
New API endpoints for SDK integration — prompt matching by content hash, active variant lookup for A/B testing, bulk SDK configuration endpoint. Testing mode setting (proxy/simulation) added to dashboard.
Send traces directly to Converra via the SDK (converra.traces.create) — no LangSmith, Langfuse, or OTel pipeline required. The fastest path from your agent to Converra.
Learn moreGive the optimizer direct feedback. Thumbs up/down from the UI or programmatic feedback via MCP tools — both feed into the optimization agent's planning so it learns from your judgment, not just metrics.
Benchmark comparisons now show actual per-conversation scores inline. See exactly how each model performed, not just a summary.
Redesigned optimization results with an activity card and deploy banner. Clearer post-optimization experience so you can review and deploy faster.
Get notified when optimizations complete or conversation syncs finish. Real-time alerts in the app so you never miss a result.
Step-level failure diagnosis now runs on every conversation — not just low-scoring multi-agent traces. Every agent gets actionable root cause analysis regardless of score or architecture.
Waitlist removed. Sign up with email or Google and start connecting agents immediately.
Start using Converra at no cost. The free tier includes conversation imports, insights, and a limited number of optimizations so you can evaluate before committing.
Focus optimization on what matters most. Choose from 24 built-in focus areas or define custom goals — simulations, evaluations, and variant generation all align to your intent.
Re-optimize agents that are already in monitoring state. Unresolved issues from prior runs carry forward automatically so the optimizer picks up where it left off.
Failures across your agents are now grouped by root cause category in the Systems view. Quickly spot whether issues stem from hallucinations, instruction gaps, tool errors, or context limits.
See recurring failure patterns for individual prompts. Identify which failure types affect each agent so you can prioritize the highest-impact fixes.
Step-level failure diagnosis now shows the actual conversation messages exchanged during the failing step, giving you full context without leaving the diagnosis view.
Import production conversations from any OpenTelemetry-compatible tracing pipeline. Connect Axiom or other OTel backends to automatically sync your agent's traces.
Learn moreProduction user feedback is now surfaced in conversation insights and factored into evaluation scores. See what real users thought alongside AI analysis.
Optimization automatically triggers when step diagnosis detects fixable failures. Winners can auto-deploy with settings-gated controls — no manual intervention needed.
Winners are automatically tested against a golden set of scenarios before deployment. Catch regressions before they reach production.
Learn moreDefine business-specific metrics beyond the built-in evaluation suite. Measure what matters most for your agent's domain.
Compare model performance side-by-side. Run the same scenarios across different LLMs to find the best fit for your agent.
Redesigned conversation insights with above-the-fold metrics, prompt links, and consolidated qualitative sections.
Multi-agent simulations now inject synthetic orchestrator context for higher fidelity. Simulated conversations reflect how your agents actually interact in production.
Learn moreOptimizations now target the specific agent step responsible for diagnosed failures, instead of optimizing blindly.
See recurring failure types across all your agents at a glance. Spot systemic issues before they become customer-facing problems.
Converra launches. Autonomous agent optimization with simulation testing, real-time performance tracking, and continuous prompt improvement.
Pinpoints which step in a multi-agent conversation caused a failure. See the execution flow, identify the responsible agent, and get actionable fix recommendations.
Learn moreThe optimizer now detects and resolves contradictions, redundancy, and formatting issues in your agent's instructions — not just metric-driven changes.
Extract variables from agent instructions and deploy optimized variants across sibling agents that share the same structure.
Converra is now accessible as an MCP server — manage agents, run simulations, and trigger optimizations from any MCP-compatible client.
Learn moreBreak down agent performance by segment. See which parts of your agent's instructions contribute most to success or failure.
Mark sections of your agent's instructions as protected so the optimizer preserves them during variant generation.
Import production conversations from Langfuse with continuous sync. Supports self-hosted instances and multi-agent trace detection.
Learn moreVariant selection now uses persona-level head-to-head comparisons as the single source of truth, eliminating false positives from aggregated scores.
Import production conversations directly from LangSmith. Connect your existing tracing pipeline to Converra without code changes.
Learn moreAutomatically detect multi-agent architectures from ingested conversations. See which agents participate and how they hand off.
Archive and delete conversations in bulk. Filter, select, and clean up your conversation data at scale.
Connect your agent and start seeing improvements in minutes.
Start for free