Sierra’s new benchmark reveals how well AI agents perform at real work


Don’t miss OpenAI, Chevron, Nvidia, Kaiser Permanente, and Capital One leaders only at VentureBeat Transform 2024. Gain essential insights about GenAI and expand your network at this exclusive three day event. Learn More


Sierra, the customer experience AI startup created by OpenAI board member Bret Taylor and Google AR/VR veteran Clay Bavor, has developed a new benchmark to evaluate the performance of conversational AI agents. Called TAU-bench, agents are tested on completing complex tasks while having multiple exchanges with LLM-simulated users to gather the required information. Early results indicate that AI agents built with simple LLM constructs such as function calling or ReAct don’t fare well regarding “relatively simple tasks,” highlighting the belief companies need more sophisticated agent architectures.

Developers interested in examining TAU-bench’s code can download it from Sierra’s GitHub repository.

TAU-bench: What you need to know

“At Sierra, our experience in enabling real-world user-facing conversational agents has made one thing extremely clear: a robust measurement of agent performance and reliability is critical to their successful deployment. Before companies deploy an AI agent, they need to measure how well it is working in as realistic a scenario as possible,” Karthik Narasimhan, Sierra’s head of research, writes.

He claims that existing benchmarks, such as WebArena, SWE-bench and Agentbench, fall short in several key areas. Though they can reveal an agent’s high-level capabilities, they only evaluate a single round of human-agent interaction like the following:


Countdown to VB Transform 2024

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now


User: “What’s the weather like in New York today?”
AI: “Today in New York, it’s sunny with a high of 75°F (24°C) and a low of 60°F (16°C).”

This is limiting because, in real-life scenarios, agents will need to obtain this information using multiple dynamic exchanges:

User: “I want to book a flight.”
AI: “Certainly! Where would you like to fly from and to?”
User: “From Chicago to Miami.”
AI: “Got it. When would you like to travel?”
User: “Next Friday.”
AI: “Okay. Do you have a preference for departure time?”
… (conversation continues)

Narasimhan argues that these benchmarks also focus on first-order statistics such as average performance. However, they don’t provide measurements of reliability or adaptability.

To address these issues with Tau-bench, Sierra identified three requirements for the benchmark. The first is that most real-world settings require agents to interact seamlessly with humans and programmatic APIs for a long period of time to gather information and solve complex problems. Next, agents must be able to accurately follow complex policies or rules specific to the task. Finally, agents must be consistent and reliable at scale to give companies peace of mind in knowing how they will behave.

TAU-bench assigns several tasks for agents to complete, from working with realistic databases and tool APIs to domain-specific policy documents dictating the required agent behavior and an LLM-based user simulator guided by instructions for diverse scenarios to generate realistic conversations with the agent. Each assignment evaluates the agent’s ability to follow rules, reason, retain information over long and complex contexts, and communicate in realistic conversation.

Example of an airline reservation agent in Sierra’s TAU-bench. Image credit: Sierra

Key features of TAU-bench

Narasimhan outlines four main features of Sierra’s new benchmark:

  • Realistic dialog and tool use: Through generative modeling for language, TAU-bench features complex user scenarios produced using natural language instead of relying on complex rule writing.
  • Open-ended and diverse tasks: TAU-bench features rich, detailed structures, interfaces and sets of rules, allowing for the creation of tasks without simple, predefined solutions. This challenges the AI agents to handle diverse situations that they could encounter in the real world.
  • Faithful objective evaluation: This benchmark doesn’t look at the quality of the conversation. Instead, it evaluates the result, the final state after the task has been completed. Doing so gives it an objective measure of whether the AI agent successfully achieves the goal of the task, eliminating the need for human judges or additional evaluators.
  • Modular framework: Because TAU-bench is built like a set of building blocks, it’s easy to add new elements such as domains, database entries, rules, APIs, tasks and evaluation metrics.

How do models fare under this metric?

Sierra tested out TAU-bench using 12 popular LLMs from OpenAI, Anthropic (Claude 3.5 Sonnet was not included), Google and Mistral. It discovered that all of them had difficulties solving tasks. In fact, the best-performing agent from OpenAI’s GPT-4o had a less than 50 percent average success rate across two domains.

A chart outlining how 12 popular LLMs performed under TAU-bench. Image credit: Sierra

In addition, all the tested agents performed “extremely poorly” on reliability and were “unable to consistently solve the exact same task when the episode is re-run.”

All this leads Narasimhan to conclude that more advanced LLMs are needed to improve reasoning and planning along with creating more complex scenarios. He also calls for new methods to make annotating easier through the use of automated tools and that more fine-grained evaluation metrics be developed to test other aspects of an agent’s behavior, such as its tone and style.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *