Releasing Explorer & Testing: Visualize and Understand AI agents

Debugging Agents Through Traces

Many different types of AI agents have emerged throughout the past year. Examples include coding agents that tackle software development tasks, web agents that interact with websites, and others designed for solving productivity tasks and cybersecurity challenges. Developers building these systems are familiar with the concept of agent traces. They serve as logs that document the step-by-step actions and decisions made by an AI agent during execution. When an issue occurs, these traces are critical for understanding the behavior and identifying faults in autonomous AI systems.

Nonetheless, to this date, analyzing trace data using traditional means, such as raw JSON files, remains a tedious and complex task. Navigating through thousands of lines of data can obscure the overall picture, making it extremely difficult to pinpoint the root cause of a bug or issue. To address this and help the agent builder community move faster, we are releasing two new tools today: Explorer and Testing.

Explorer: Visually Gain Insight Into Agentic Behavior

Explorer is an advanced observability tool designed to simplify the analysis of agent traces. It allows users to quickly browse and understand large volumes of trace data at a glance, making it easier to identify anomalies and critical decision points in an intuitive and visual manner.

Debugging Your Own Agent’s Traces

**Trace Viewer**: See all messages in a trace.

Explorer aims to be used throughout the debugging process of developers working with AI agents. The tool provides a visual and interactive interface to analyze traces. By highlighting key steps in an agent’s decision-making process, Explorer enables you to identify bottlenecks, anomalies, or bugs with ease.

Using Explorer, you can:

Visualize traces step-by-step: Gain a clear, chronological view of your agent’s actions.
Annotate critical moments: Add notes or explanations to key decision points for better collaboration and record-keeping together with your team
Filter and search efficiently: Focus on specific sections of the trace without getting lost in excessive details.
Share traces for collaborative debugging and documentation (e.g. you can include traces in GitHub issues and Pull Requests)

If an agent fails to complete a task, Explorer allows you to pinpoint exactly where things went wrong and analyze why. Whether it’s a logic flaw in the agent’s design, a tool use issue or an unexpected external condition, Explorer gives you the clarity you need to address the issue effectively.

Testing and Observability

**Unit Tests**: See assertion results from all your unit tests.

Alongside debugging, Explorer integrates seamlessly with our new agent testing library. Testing and Explorer jointly allow developers to test their agent’s capabilities while also keeping track of new and old failure modes, as their agent system evolves. It enables a new form of test-driven agent development, preventing capability regressions and giving agent builders confidence in the products they ship.

With Invariant Testing developers can create controlled environments to simulate edge cases, stress scenarios, and performance benchmarks. When paired with Explorer, you can visualize the outcomes of these tests in real-time, making it easier to understand your agent’s capability level and failure modes.

Together, Explorer and Testing create a powerful toolkit to support building robust, reliable AI Agents. To learn about how to get started with Invariant Testing, visit the documentation available here. Testing is available today and can be used via the designated invariant-ai PyPI package.

Public Registry of Agent Benchmarks

**Benchmarks Registry**: Explore traces from the most popular agent benchmarks.

Based on Explorer, we have also launched a public registry of agent benchmarks. We selected some of the most popular agent benchmarks together with agents whose traces are publicly available in order to demonstrate how each can handle different types of agents effectively.

Traditionally, agent datasets from benchmarks like SWE-bench, Cybench, or Webarena are stored in raw formats, which makes them cumbersome to browse and interpret. You have to manually parse information, often losing valuable time and context as you dig for insights. The raw JSON files are functional, but they're not human-readable - the key insights are buried within a sea of unstructured data and extra metadata, making it particularly difficult to track down bugs or find issues.

To browse different benchmarks you can go to the Invariant Benchmarks Registry, where we selected a handful of agents from each of the benchmarks SWE-Bench, Cybench, Webarena and others. We built Explorer aiming for an easy to use interface optimized for trace debugging and collaboration. Logged-in users can also drop annotations on traces and easily share them with their team, for collaborative agent debugging.

Open Source

As natives to the AI eco-system, we are committed to open source and thus release both Invariant Explorer and Invariant Testing as open source packages. Learn how to get started using the following resources:

We invite you to try out Explorer and Testing today. If you have any feedback, please write to us directly at [email protected]. Whether you're a researcher, a developer, or simply curious about agents, Explorer is designed to make your life easier and help you gain a deeper understanding of agent behavior. We’re eager to hear your feedback, thoughts and welcome any open source contributions.

Authors: Lavinia Petrache, Luca Beurer-Kellner

2024-12-17