News

AI Research Agent Capabilities Evaluated

Source: unite.ai

Published on June 3, 2025

Updated on June 3, 2025

An AI research agent analyzing data on a digital interface

AI Research Agent Capabilities Evaluated

As large language models (LLMs) continue to advance, their potential as research assistants is becoming increasingly clear. These models are now capable of handling "deep research" tasks, which require complex reasoning, evaluating conflicting information, and synthesizing data from various sources. A new report by FutureSearch, titled Deep Research Bench (DRB): Evaluating Web Research Agents, provides the most comprehensive evaluation of these capabilities to date, highlighting both their strengths and limitations.

The Deep Research Bench is a meticulously designed benchmark that assesses AI agents' performance in multi-step, web-based research tasks. These tasks mimic the real-world challenges faced by analysts, policymakers, and researchers, and include 89 distinct tasks across 8 categories. Each task is structured with human-verified answers and evaluated using a frozen dataset of web pages, ensuring consistency across model evaluations.

The ReAct Architecture

At the core of the Deep Research Bench is the ReAct architecture, which stands for "Reason + Act." This method mimics the approach of a human researcher by thinking through a task, performing actions like web searches, observing results, and deciding whether to iterate or conclude. While earlier models explicitly followed this loop, newer "thinking" models often streamline the process, integrating reasoning more fluidly into their actions.

To maintain consistency in evaluations, the DRB introduces RetroSearch—a static version of the web. Instead of relying on the constantly changing live internet, agents use a curated archive of over 189,000 web pages, frozen in time. This ensures a fair and replicable testing environment, especially for high-complexity tasks like "Gather Evidence."

Performance Highlights

Among the models evaluated, OpenAI’s o3 emerged as the top performer, scoring 0.51 out of a possible 1.0 on the Deep Research Bench. Despite ambiguity in task definitions and scoring, even a flawless agent would likely achieve around 0.8. Other notable performers include Anthropic’s Claude 3.7 Sonnet, Google’s Gemini 2.5 Pro, and the open-weight DeepSeek-R1, which narrowed the performance gap between open and closed models.

Newer, "thinking-enabled" models consistently outperformed their predecessors, with closed-source models maintaining a notable edge over open-weight alternatives. However, even the best models today still fall short of well-informed human researchers, particularly in tasks requiring strategic planning and nuanced reasoning.

Failure Patterns

The report also highlights common failure patterns among AI agents. One significant issue is "forgetfulness," where models lose track of key details as the context window stretches. Other failures include repetitive tool use, poor query crafting, and premature conclusions. These issues underscore the need for further improvements in AI agents' ability to maintain context and reason effectively over extended tasks.

Toolless Agents

Interestingly, the DRB also evaluated "toolless" agents—models operating without access to external tools like web search or document retrieval. These agents rely solely on their internal training data and memory. Surprisingly, they performed almost as well as full research agents on certain tasks, such as assessing the plausibility of claims.

However, on more demanding tasks like "Derive Number" or "Gather Evidence," toolless models struggled, lacking the ability to access up-to-date information. This highlights the importance of tool-augmented agents for deep research, which requires reasoning with verifiable, current data.

Conclusion

The Deep Research Bench report provides valuable insights into the current state of AI research agents. While these models show impressive capabilities, they still lag behind skilled human researchers in complex tasks. As LLMs continue to evolve, tools like the DRB will be essential for assessing not just what these systems know, but how well they perform in real-world research scenarios.