What we've learned from analyzing hundreds of AI web agent traces

There is a significant amount of hype surrounding AI agents, but close inspection of the agent ecosystem shows that most existing agents are stuck in the demo phase and far from deployment. In this blog post, we dive deep into the traces produced by a leading web agent and investigate its most frequent failure modes. It turns out that many agent failures are easy to fix and avoidable: we address these issues and achieve up to 16% improvements on the widely used WebArena web agent benchmark.

What are AI agents?

Defining what constitutes an AI agent is challenging, but, for the sake of this blog post, they can generally be considered compound AI systems that combine reasoning and tool usage to accomplish user-specified tasks, such as online shopping or navigating a map. Currently, we are witnessing a surge in diverse applications of these agents: code agents assist in writing and debugging software, customer support agents manage and respond to customer inquiries, and web agents automate a variety of tasks within web browsers.

In this blog post, we specifically look into web agents whose goal is to automate typical tasks in the browser. This includes tasks like online shopping, booking a flight or ordering food. To succeed, the agent needs to: (1) understand the content on a given web page which can be represented in many different ways (e.g. HTML code, a UI accessibility tree or simply a screenshot), Then, (2) it needs to reason about possible next steps, and (3) perform a desired action in the environment it operates in (clicking, typing, etc.). Each of these aspects can be very challenging as the agent autonomously explores web pages to find different buttons, text inputs or even backtracks whenever it cannot find the desired information.

In this post, we investigate the top performing web agent on the WebArena benchmark. We are used to hearing that LLM benchmarks are getting saturated - for instance, the recent Claude-3.5-Sonnet model scores almost 90% on the prominent MMLU benchmark. However, when we turn to web agent benchmarks, the results paint a very different picture, which suggests that agent tasks are significantly more difficult than question-answering and other text generation tasks. For instance, the strongest agent on the WebArena leaderboard relies on the SteP algorithm and still fails on a large majority of the tested agent tasks.

While the full WebArena benchmark consists of 5 different websites, in this blog post we focus on two of them: OpenStreetMap and ShoppingAdmin. Tasks in OpenStreetMap typically ask the agent to find directions between locations (e.g. "closest ice cream shop to Carnegie Mellon University"), while the tasks in ShoppingAdmin require administrating the content of a shopping website (e.g. "increase the price of product X by 10%").

Why do Web Agents fail?

Improving the performance of an agent requires understanding its failure modes. An important starting point is to first log all agent executions in the form of traces so we can gain visibility into its behavior. If you are not logging your traces, I would strongly encourage you to start. For the sake of this blog post, each trace is logged as a simple JSON file, even though, of course, more sophisticated ways of logging are available. Once we stored all the traces, we can start analyzing them to find the most common failure modes that our agent is facing during its execution.

Trace Analysis

Analyzing agent traces is one of the most powerful ways to debug AI agents and helps us understand the most common mistakes our agent makes.

For instance, in our web agent a concrete trace could look like this:

{'role': 'user', 'content': 'Tell me the email address of the contributor who has the most commits to branch main'}
{'role': 'tool', 'content': 'Tab 0 (current): CellularPrivacy / Android-IMSI-Catcher-Detector · GitLab\n\n[1] RootWebArea...'}
{'role': 'assistant', 'content': 'To find the email address of the contributor with the most commits to the main branch, we ...'}
{'name': 'click', 'arguments': {'element_id': '347', 'extra_arg': "link 'Repository'"}}
{'role': 'tool', 'content': 'Tab 0 (current): Files · development · CellularPrivacy / Android-IMSI-Catcher-Detector · G...'}
{'role': 'assistant', 'content': 'To find the email address of the contributor with the most commits to the branch main, we ...'}
{'name': 'click', 'arguments': {'element_id': '3740', 'extra_arg': "link 'Contributors'"}}
{'role': 'tool', 'content': 'Tab 0 (current): Contributors · CellularPrivacy / Android-IMSI-Catcher-Detector · GitLab\n\n...'}
{'role': 'assistant', 'content': 'The objective is to find the email address of the contributor who has the most commits to ...'}
{'name': 'stop', 'arguments': {'answer': '[email protected]'}}

The trace typically begins with a message from the user, asking the agent to perform given task. In this example, the task is to find the e-mail address of the contributor with the most commits to the main branch of a given repository. Provided with that, the agent starts to use different browser "tools" (e.g. clicking, navigating, typing) while also continously explaining in messages what it is doing. Each tool use results in a tool output: for our web agent, this is a UI accessibility tree of the web page content as extracted from a browser, after performing the respective action. Finally, in the last step, the agent invokes the "stop" tool which denotes returning the final result to the user.

The trace shown above is an example of a trace where the agent successfully achieves its goal. However, of course to learn about how our agent fails, we are much more interested in what the bad traces look like. To answer this question, we analyzed hundreds of web agent traces and discovered several common failure modes as discussed in the following sections.

Looping

The most common complaint with agents is that they easily get stuck in loops, endlessly repeating the same actions such as clicking on the same button or typing into the same textbox. Our first step in investigating this behavior is to find all traces where agents get stuck, repeating the same action many times (at least 15 times). Let's look at the sequence of tool calls (omitting the tool outputs for now):

{'role': 'user', 'content': 'Find the bar around Carnegie Music Hall'}
{'name': 'type', 'arguments': {'text': 'Carnegie Music Hall\n', 'element_id': '35', 'extra_arg': "textbox 'Search' focused: True required: False"}}
{'name': 'type', 'arguments': {'text': 'bar near Carnegie Music Hall\n', 'element_id': '35', 'extra_arg': "textbox 'Search' focused: True required: False"}}
{'name': 'type', 'arguments': {'text': 'bar near Carnegie Music Hall\n', 'element_id': '35', 'extra_arg': "textbox 'Search' focused: True required: False"}}
{'name': 'type', 'arguments': {'text': 'bar near Carnegie Music Hall\n', 'element_id': '35', 'extra_arg': "textbox 'Search' focused: True required: False"}}
{'name': 'type', 'arguments': {'text': 'bar near Carnegie Music Hall\n', 'element_id': '35', 'extra_arg': "textbox 'Search' focused: True required: False"}}

[truncated: the above line repeated 15 times]

{'name': 'type', 'arguments': {'text': 'bars near Carnegie Music Hall\n', 'element_id': '35', 'extra_arg': "textbox 'Search' focused: True required: False"}}
{'name': 'type', 'arguments': {'text': 'bars near Carnegie Music Hall\n', 'element_id': '35', 'extra_arg': "textbox 'Search' focused: True required: False"}}

Here we can see that the line gets repeated 20 times, and the agent is clearly stuck. To understand what is happening here, let's look at the difference of page content between two successive tool calls:

{'name': 'type', 'arguments': {'text': 'bar near Carnegie Music Hall\n', 'element_id': '35', 'extra_arg': "textbox 'Search' focused: True required: False"}}
> diff:  [] StaticText 'Carnegie Music Hbar near Cbar near Carnegie Music Hallarnegie Music Hallall'
{'name': 'type', 'arguments': {'text': 'bar near Carnegie Music Hall\n', 'element_id': '35', 'extra_arg': "textbox 'Search' focused: True required: False"}}
> diff:  [] StaticText 'Carnegie Music Hbar near Cbar near Cbar near Carnegie Music Hallarnegie Music Hallarnegie Music Hallall'

The above diff shows what is going on: the agent first types 'Carnegie Music Hall' in the textbox, which returns no results. The agent then recognizes that it needs to revise its search query and types 'bar near Carnegie Music Hall' into the search box. However, instead of replacing the previous query, the characters are appended to the existing text instead. If we look into the Playwright code that executes the agent actions, the problem turns out to be an interaction issue between the agent and the environment, where the agent has no way to overwrite the existing text in the textbox.

Fixing Repeated Typing: To fix this repeated typing bug, we change the type action to overwrite the text in the textbox instead of appending to it (by applying Ctrl+A Ctrl+Backspace before typing). Of course, if the agent indeed needs to append text, it now has to overwrite the textbox with the concatenation of the existing and the new text. However, this remains possible under the revised action implementation, meaning the agent can now handle both cases.

Hallucinations

It is well known that AI agents can "hallucinate" and invent information that has no factual basis. During trace analysis, we noticed that the agent can enter e-mails and personal information. For example, in the trace below the agent is tasked to write an e-mail to a shop owner, but hallucinates name and non-existent e-mail address of the customer.

{'role': 'user', 'content': 'Draft an email to the shop owner via their contact us function for a coupon as I am a stud...'}
{'name': 'click', 'arguments': {'element_id': '365', 'extra_arg': "link 'Contact Us'"}}
{'name': 'type', 'arguments': {'text': 'Your Name', 'element_id': '3001', 'extra_arg': "textbox 'Name*' required: True"}}
{'name': 'type', 'arguments': {'text': 'Alex Johnson', 'element_id': '3001', 'extra_arg': "textbox 'Name*' focused: True required: True"}}
{'name': 'type', 'arguments': {'text': '[email protected]', 'element_id': '3007', 'extra_arg': "textbox 'Email*' required: True"}}
{'name': 'type', 'arguments': {'text': 'Dear Shop Owner, I am currently a student and a frequent shopper at your store. I am reaching out to inquire if you offer any discounts or coupons for students. Your products have been of great help in my studies and daily life, and any form of discount would greatly assist me in continuing to purchase from your store. Thank you for considering my request. Looking forward to your positive response. Best regards, Alex Johnson', 'element_id': '3019', 'extra_arg': "textbox 'What’s on your mind? *' required: True"}}
{'name': 'click', 'arguments': {'element_id': '2872', 'extra_arg': "button 'Submit'"}}
{'name': 'stop', 'arguments': {'answer': 'Form submitted successfully'}}

Fixing Hallucination Issues: Hallucination is one of the most difficult problems with LLMs with a lot of active research still ongoing. In this case, we can try to improve the agent prompt, instructing it to avoid using memorized information in the tools, and rather rely on the information it has extracted from the environment. Of course, this is not a complete solution, as hallucination can be very subtle and hard to detect, but it is a good starting point.

Environment Errors

After further looking into the traces with repeated actions, we noticed that combobox appears often (combobox is an element that represents the dropdown menu). So we specifically started to look at traces where 1) some action repeats at least 5 times, 2) the word combobox appears in agent messages.

Here is one of the traces we observe:

{'role': 'user', 'content': 'Show me the "Canon photo printer" listings by search relevance, from most to least.'}
{'name': 'type', 'arguments': {'text': 'Canon photo printer\n', 'element_id': '558', 'extra_arg': "combobox '\\ue615 Search' autocomplete: both hasPopup: listbox required: False expanded: False"}}
{'name': 'click', 'arguments': {'element_id': '2881', 'extra_arg': "combobox 'Sort By' hasPopup: menu expanded: False"}}
{'name': 'click', 'arguments': {'element_id': '2881', 'extra_arg': "combobox 'Sort By' focused: True hasPopup: menu expanded: False"}}
{'name': 'click', 'arguments': {'element_id': '2881', 'extra_arg': "combobox 'Sort By' focused: True hasPopup: menu expanded: False"}}
{'name': 'click', 'arguments': {'element_id': '2881', 'extra_arg': "combobox 'Sort By' focused: True hasPopup: menu expanded: False"}}
...

Clearly, there is something wrong with the agent's ability to select an option from a dropdown menu. Turns out that the WebArena environment interface does not allow the agent to select from a dropdown menu due to issues with UI accessibility trees in Playwright (as also discovered by other people using the benchmark).

Fixing The Dropdown Issue: To solve this problem, we expose a new action select_option[id=label] to the agent. When the agent uses this action, we do not rely on the accessibility tree for option selection, but rather use a heuristic that locates the HTML option element directly, by checking for the given label. In practice this works well and would really only cause problems if there were two different dropdown menus with same label as an option.

Ignoring Parts of the Instructions

Agents are often somewhat "lazy" and return the response that can be obtained based on the first observation they receive, ignoring parts of its instructions in the process. Below we can see a trace that was recorded for an agent that was queried for the best-selling products in Jan 2023 in an online shop environment. The front page has a section "best-selling products", but these correspond to the current time period in 2024, and querying for Jan 2023 would result in a different answers. However, our agent simply returns the result based on the front page data, ignoring the time period constraint.

{'role': 'user', 'content': 'What are the top-3 best-selling product in Jan 2023'}
{'name': 'stop', 'arguments': {'answer': 'The top-3 best-selling products in Jan 2023 are: 1. Sprite Stasis Ball 65 cm, 2. Sprite Yoga Strap 6 foot, 3. Quest Lumaflex™ Band.'}}

Fixing Instruction-Following: This again is difficult to fix, as the issue is not with the agent itself, but rather with the instruction-following capabilities of the underlying model. However, we can improve our agent by instructing the model to never return a result based on the front page results only, if the question involves a specific time period.

Other Errors

We also find many other types of failure, however, these are often also caused by the way the WebArena benchmark is constructed. Firstly, we find cases where the benchmark judges an agent action to be correct if and only if the output contains an exact match of a certain test string. For instance, in a task where the agent is asked to compose a product review, the trace is only considered correct if the output contains the string "I am a loyal customer", which is overly restrictive as an agent might phrase this statement differently while still saying the same thing. Secondly, there are some tasks that are difficult even for humans to solve, because of environment sensitivity. For instance, OpenStreetMap may not recognize "A next to B", but may recognize "A near B" type of queries, which requires domain knowledge about the OpenStreetMap-specific search function. While these issues indicate a need for a more robust benchmark, they are not directly fixable by the agent itself and we therefore do not focus on them in this blog post.

Improved Results

We implemented the fixes to all the issues that we diagnosed above and re-ran our improved version of a SteP agent. We also replaced the base agent model with the latest Claude-3.5-Sonnet model. These improvements increase the success rate on ShoppingAdmin subset by 7% (from 24% to 31%), and on OpenStreetMap by 16% (from 30% to 46%), compared to the latest SteP agent results at the time of writing of this blog post. While the performance on these benchmarks is still far from satisfactory, the results indicate that simple improvements can lead to relatively significant increase in performance.

Dataset	SteP	SteP-Improved
WebArena-OpenStreetMap	30%	46%
WebArena-ShoppingAdmin	24%	31%

It is interesting to compare the performance increase obtained using these simple fixes to the performance increase obtained using algorithmic advances. For instance, using the SteP algorithm increases the success rate from roughly 20% to 30% compared to the simple baseline agent. This indicates that the most powerful way to build agents is to combine algorithmic advances with deep investigations of underlying agent problems and fixing those directly.

A Way Forward to Better Agents

Finally, now that we identified various sources of agent errors, the question is how to fix these and move agents from the current demo stage to production and to realize their full potential.

We believe that several things will be key in this transition:

More Capable Base Models

The most obvious idea is to simply upgrade the underlying LLM to a more capable model. In this case, we found the recently released Claude 3.5 Sonnet to perform better than GPT-4o in the original experiments. This can be considered as an instance of the bitter lesson as improvements in prompting and algorithms can often be overshadowed by simply using bigger and more capable base models.

Environment-Agent interface

Nonetheless, even the best model would not be able to go around the environment issues that do not allow the agent to perform certain actions such as clicking on dropdown menus. At a high level, these types of errors stem from the interface between the environment and the agent. For instance, fixing the dropdown issue required fixing the bug in the environment, while typing problem required adapting the agent action itself. This is similar to findings reported by SWE-agent where one of the key advantages turned out to be the adaptation of the action space to provide LLM-friendly commands. Many of the current environments that agents operate in, were built with humans in mind (e.g. the web, standard APIs) and not agents. Re-thinking and adapting these for agents will be crucial moving forward and will likely lead to significant improvements in agent performance.

Algorithmic Advances

Given base LLMs, reasoning algorithms determine how the agent interacts with the environment. The core aspect is determining and planning the sequence of actions that achieve the goal. This is either done on-the-fly through approaches like ReAct, or alternatively, by first planning and executing later. Additionally, agents can also use approaches like reflection to revise their own plans in case they fail or turn out to be suboptimal.

Error Detection and Diagnostics

The final important ingredient in performance improvement is error detection. Without this, it may often be difficult to figure out what is even going wrong with an agent and what can we improve, especially once agents operate at scale. This is also an extremely difficult problem. The examples shown in this blog post already show how diverse the set of errors is. Production-scale systems will further exacerbate this problem, as the amount and diversity of agent errors will increase significantly. Nonetheless, agent errors like looping or hallucination can often be reliably detected and used for debugging. To help with this, Invariant Labs is building analysis tools to automatically catch as many bugs as possible and detect agent vulnerabilities and failures early on, so that they can be fixed before they cause any harm.

Conclusion

AI agents are a fascinating and emerging technology which is on the verge of being deployed widely in many different domains. In this blog post, we examined some of the common errors that these agents are still facing and we discussed several possible solutions that ultimately lead to significant improvements in agent performance. Of course, this is not yet a complete picture of what is needed to build better agents, and there are many more ways to improve agents. So if you are working with AI agents and facing similar problems, do not hesitate to get in touch with us at [email protected] - we would love to learn more about what you are building!

Authors: Mislav Balunović

2024-07-10