FutureSearch Benchmarks

If you would like to run an LLM or agent on our benchmarks or would like to see a model added to the leaderboard, please contact us at evals@futuresearch.ai.

Deep Research Bench (DRB)

Deep Research Bench (DRB) benchmarks how well LLM agents do research on the web. Each of the 0 diverse, real-world tasks provides 10-100k webpages stored offline for search and reasoning, accompanied by carefully curated answers.

Leaderboard

Last updated:

Model
Score
Loading radar chart...

Shown are scores averaged over first across all instances per task category (Radar chart to the right). Averaged scores per task are then aggregated across all tasks to get the overall score (table to the left).

Methods

Paper

We published paper with an earlier version of DRB (we refer to the paper version as "DRB-250506"). The paper includes a more detailed overview of the benchmark and its methodology, as well as an evaluation of several web research tools such as OpenAI Deep Research, Perplexity, or Claude Research.

Note: Results between the paper version and the continuously updated version presented on this page (referred to simply as "DRB") are not directly comparable.

Task Instances: The paper version DRB-250506 had 89 task instances. We have since updated a few task instances and now have 0 instances.

Agent architecture: We improved the agent architecture powering the agents, making them more robust and reliable.

RetroSearch

DRB uses RetroSearch, a system designed to serve agents a frozen, previously scraped version of the internet instead of the live pages.

RetroSearch aims to emulate Google search (specifically, the Serper search API) as closely as possible, so as to minimize differences between live and "retro" agent runs. A single RetroSearch search query follows the following steps:

  • Run a live Serper search for the query
  • Look up pages obtained from live search in the RetroSearch database and other archive sources
  • If the page is not found in the RetroSearch database, remove it from the results
  • Write new snippets from a sample of page content using a simple LLM
  • Return the results in the original format of the Google results

This approach ensures a search experience for agents that is consistent with real search, but backed exclusively by pages we have a frozen candidate for. The following diagram from the paper illustrates the process:

Diagram showing how RetroSearch provides frozen web snapshots to agents
Illustration of the system architecture of Deep Research Bench using RetroSearch. This shows the flow from task definition through the scraping pipeline that populates the RetroSearch database prior to running the benchmark, and then how agents use RetroSearch via an API at the time of task evaluation.

Task instances

We have 0 task instances across 8 task categories, each of which is designed to test a different research capability. The following table from the paper gives an overview of the task categories:

Task TypeDescription and Example
Find Number

Find a reliable, known number on the internet.

Example: The total number of FDA Class II Product Recalls of medical devices.

Find Dataset

Find links to datasets relevant to a given query.

Example: How many IPOs with an offer price of at least $5.00 were there in each year between 1980 and 2024?

Find Original Source

Find the original source of a given claim.

Example: From <LINK>, more than 8 out of 1000 users clicked on a phishing link monthly in 2024, up 190% vs 2023.

Validate Claim

Estimate the probability that a given claim on the internet is true.

Example: The median energy usage of ChatGPT queries is at least 10× greater than Google searches.

Derive Number

Derive a number not known on the internet, but derivable from known information.

Example: How many IM and GM account closures did chess.com report for 2024?

Gather Evidence

Identify key pieces of evidence relevant to a given query.

Example: What is the difficulty of the problems in the FrontierMath benchmark?

Populate Reference Class

Compile a list of instances that fit a given description.

Example: List functioning satellites impacted by accidental high-speed collisions in space.

Compile Dataset

Compile a dataset based on a description of the desired data and required columns.

Example: Software developer jobs in the US from 2019-2023 with columns: year, number, source, url, percent change.

Scoring

For every task instance, agents receive a score between 0 and 1. Scoring is slightly different for each task category (see Details). Scores are first averaged across all instances per task category, and then averaged again across all tasks to get the overall score.

The following table from the paper gives a more detailed overview of the scoring method for each task category:

TaskScoring MethodSuccess Criteria
Compile DatasetPrecision, Recall, F1Comparison of the rows of the dataset returned by the agent with the ground truth
Derive Number0/1 Binary ScoreNumber is correct (or for some tasks, number is within a reasonable range)
Find DatasetRecallProportion of URLs in the list of required dataset(s) found
Find Number0/1 Binary Score

1. Number correct,

2. Number backed up by excerpt from URL,

3. Source at least as reliable as ground truth source

Find Original Source0/1 Binary ScoreURL is in a list of permissible truth URLs
Gather EvidenceRecallComparison with a minimal list of evidence items to be found
Populate Reference ClassPrecision, Recall, F1
(sometimes only Recall)
Comparison of agent list with ground truth list
Validate ClaimAbsolute Difference in Assigned ProbabilityComparison of agent assessment with human researcher assessment

Bench to the Future (BTF)

Bench to the Future (BTF) is a benchmark designed to evaluate the ability of LLM agents to make predictions on messy, real-world forecasting questions. For full details, see the Bench to the Future paper.

Methods

Forecasting is a challenging task that offers a clearly measurable way to study AI systems. It requires a large amount of research on the internet, as well as good judgement to weigh and interpret available evidence.

Bench To the Future (BTF) is a "pastcasting" benchmark with hundreds of questions for which the resolution is already known. Each question is accompanied by a large offline corpus of tens of thousands of relevant web pages, enabling a way to elicit realistic "forecasts" on past events from LLMs.

We invite researchers to contact us at hello@futuresearch.ai to utilize our benchmark or tooling for their own research.