FutureSearch Benchmarks

Agent Leaderboard

Last updated:

Model	Score	Cost ($)	Est. Runtime (s)

Loading radar chart...

Shown are scores averaged over first across all instances per task category (Radar chart to the right). Averaged scores per task are then aggregated across all tasks to get the overall score (table to the left). Runtime is estimated as (number of ReAct steps + 1) × average time per step for a given model, rather than wall-clock time, which is affected by our internal queue depths, provider TPM limits, deployments, and other operational factors.

Best Accuracy per Dollar on DRB

No data available

Best Accuracy per Second on DRB

No data available

Commercial AI products

The following heatmap shows the performance of commercial AI products on Deep Research Bench.

Methods

Paper

We published a paper with an earlier version of DRB (we refer to the paper version as "DRB-250506"). The paper includes a more detailed overview of the benchmark and its methodology, as well as an evaluation of several web research tools such as OpenAI Deep Research, Perplexity, or Claude Research.

Note: Results between the paper version and the continuously updated version presented on this page (referred to simply as "DRB") are not directly comparable.

Task Instances: The paper version DRB-250506 had 89 task instances. We have since updated a few task instances and now have 0 instances.

Agent architecture: We improved the agent architecture powering the agents, making them more robust and reliable.

RetroSearch

DRB uses RetroSearch, a system designed to serve agents a frozen, previously scraped version of the internet instead of the live pages.

RetroSearch aims to emulate Google search (specifically, the Serper search API) as closely as possible, so as to minimize differences between live and "retro" agent runs. A single RetroSearch search query follows the following steps:

Run a live Serper search for the query
Look up pages obtained from live search in the RetroSearch database and other archive sources
If the page is not found in the RetroSearch database, remove it from the results
Write new snippets from a sample of page content using a simple LLM
Return the results in the original format of the Google results

This approach ensures a search experience for agents that is consistent with real search, but backed exclusively by pages we have a frozen candidate for. The following diagram from the paper illustrates the process:

Diagram showing how RetroSearch provides frozen web snapshots to agents — Illustration of the system architecture of Deep Research Bench using RetroSearch. This shows the flow from task definition through the scraping pipeline that populates the RetroSearch database prior to running the benchmark, and then how agents use RetroSearch via an API at the time of task evaluation.

Task instances

We have 0 task instances across 8 task categories, each of which is designed to test a different research capability. The following table from the paper gives an overview of the task categories:

Task Type	Description and Example
Find Number	Find a reliable, known number on the internet. Example: The total number of FDA Class II Product Recalls of medical devices.
Find Dataset	Find links to datasets relevant to a given query. Example: How many IPOs with an offer price of at least $5.00 were there in each year between 1980 and 2024?
Find Original Source	Find the original source of a given claim. Example: From <LINK>, more than 8 out of 1000 users clicked on a phishing link monthly in 2024, up 190% vs 2023.
Validate Claim	Estimate the probability that a given claim on the internet is true. Example: The median energy usage of ChatGPT queries is at least 10× greater than Google searches.
Derive Number	Derive a number not known on the internet, but derivable from known information. Example: How many IM and GM account closures did chess.com report for 2024?
Gather Evidence	Identify key pieces of evidence relevant to a given query. Example: What is the difficulty of the problems in the FrontierMath benchmark?
Populate Reference Class	Compile a list of instances that fit a given description. Example: List functioning satellites impacted by accidental high-speed collisions in space.
Compile Dataset	Compile a dataset based on a description of the desired data and required columns. Example: Software developer jobs in the US from 2019-2023 with columns: year, number, source, url, percent change.

Scoring

For every task instance, agents receive a score between 0 and 1. Scoring is slightly different for each task category (see Details). Scores are first averaged across all instances per task category, and then averaged again across all tasks to get the overall score.

The following table from the paper gives a more detailed overview of the scoring method for each task category:

Task	Scoring Method	Success Criteria
Compile Dataset	Precision, Recall, F1	Comparison of the rows of the dataset returned by the agent with the ground truth
Derive Number	0/1 Binary Score	Number is correct (or for some tasks, number is within a reasonable range)
Find Dataset	Recall	Proportion of URLs in the list of required dataset(s) found
Find Number	0/1 Binary Score	1. Number correct, 2. Number backed up by excerpt from URL, 3. Source at least as reliable as ground truth source
Find Original Source	0/1 Binary Score	URL is in a list of permissible truth URLs
Gather Evidence	Recall	Comparison with a minimal list of evidence items to be found
Populate Reference Class	Precision, Recall, F1 (sometimes only Recall)	Comparison of agent list with ground truth list
Validate Claim	Absolute Difference in Assigned Probability	Comparison of agent assessment with human researcher assessment

FutureSearch Benchmarks

Deep Research Bench (DRB)

Agent Leaderboard

Best Accuracy per Dollar on DRB

Best Accuracy per Second on DRB

Commercial AI products

Methods

Paper

RetroSearch

Task instances

Scoring

Bench to the Future (BTF)

Methods