Leaderboard
Last updated:
Model | Score |
---|---|
Shown are scores averaged over first across all instances per task category (Radar chart to the right). Averaged scores per task are then aggregated across all tasks to get the overall score (table to the left).
If you would like to run an LLM or agent on our benchmarks or would like to see a model added to the leaderboard, please contact us at evals@futuresearch.ai.
Deep Research Bench (DRB) benchmarks how well LLM agents do research on the web. Each of the 0 diverse, real-world tasks provides 10-100k webpages stored offline for search and reasoning, accompanied by carefully curated answers.
Last updated:
Model | Score |
---|---|
Shown are scores averaged over first across all instances per task category (Radar chart to the right). Averaged scores per task are then aggregated across all tasks to get the overall score (table to the left).
We published paper with an earlier version of DRB (we refer to the paper version as "DRB-250506"). The paper includes a more detailed overview of the benchmark and its methodology, as well as an evaluation of several web research tools such as OpenAI Deep Research, Perplexity, or Claude Research.
Note: Results between the paper version and the continuously updated version presented on this page (referred to simply as "DRB") are not directly comparable.
Task Instances: The paper version DRB-250506 had 89 task instances. We have since updated a few task instances and now have 0 instances.
Agent architecture: We improved the agent architecture powering the agents, making them more robust and reliable.
DRB uses RetroSearch, a system designed to serve agents a frozen, previously scraped version of the internet instead of the live pages.
RetroSearch aims to emulate Google search (specifically, the Serper search API) as closely as possible, so as to minimize differences between live and "retro" agent runs. A single RetroSearch search query follows the following steps:
This approach ensures a search experience for agents that is consistent with real search, but backed exclusively by pages we have a frozen candidate for. The following diagram from the paper illustrates the process:
We have 0 task instances across 8 task categories, each of which is designed to test a different research capability. The following table from the paper gives an overview of the task categories:
Task Type | Description and Example |
---|---|
Find Number | Find a reliable, known number on the internet. Example: The total number of FDA Class II Product Recalls of medical devices. |
Find Dataset | Find links to datasets relevant to a given query. Example: How many IPOs with an offer price of at least $5.00 were there in each year between 1980 and 2024? |
Find Original Source | Find the original source of a given claim. Example: From <LINK>, more than 8 out of 1000 users clicked on a phishing link monthly in 2024, up 190% vs 2023. |
Validate Claim | Estimate the probability that a given claim on the internet is true. Example: The median energy usage of ChatGPT queries is at least 10× greater than Google searches. |
Derive Number | Derive a number not known on the internet, but derivable from known information. Example: How many IM and GM account closures did chess.com report for 2024? |
Gather Evidence | Identify key pieces of evidence relevant to a given query. Example: What is the difficulty of the problems in the FrontierMath benchmark? |
Populate Reference Class | Compile a list of instances that fit a given description. Example: List functioning satellites impacted by accidental high-speed collisions in space. |
Compile Dataset | Compile a dataset based on a description of the desired data and required columns. Example: Software developer jobs in the US from 2019-2023 with columns: year, number, source, url, percent change. |
For every task instance, agents receive a score between 0 and 1. Scoring is slightly different for each task category (see Details). Scores are first averaged across all instances per task category, and then averaged again across all tasks to get the overall score.
The following table from the paper gives a more detailed overview of the scoring method for each task category:
Task | Scoring Method | Success Criteria |
---|---|---|
Compile Dataset | Precision, Recall, F1 | Comparison of the rows of the dataset returned by the agent with the ground truth |
Derive Number | 0/1 Binary Score | Number is correct (or for some tasks, number is within a reasonable range) |
Find Dataset | Recall | Proportion of URLs in the list of required dataset(s) found |
Find Number | 0/1 Binary Score | 1. Number correct, 2. Number backed up by excerpt from URL, 3. Source at least as reliable as ground truth source |
Find Original Source | 0/1 Binary Score | URL is in a list of permissible truth URLs |
Gather Evidence | Recall | Comparison with a minimal list of evidence items to be found |
Populate Reference Class | Precision, Recall, F1 (sometimes only Recall) | Comparison of agent list with ground truth list |
Validate Claim | Absolute Difference in Assigned Probability | Comparison of agent assessment with human researcher assessment |
Bench to the Future (BTF) is a benchmark designed to evaluate the ability of LLM agents to make predictions on messy, real-world forecasting questions. For full details, see the Bench to the Future paper.
Forecasting is a challenging task that offers a clearly measurable way to study AI systems. It requires a large amount of research on the internet, as well as good judgement to weigh and interpret available evidence.
Bench To the Future (BTF) is a "pastcasting" benchmark with hundreds of questions for which the resolution is already known. Each question is accompanied by a large offline corpus of tens of thousands of relevant web pages, enabling a way to elicit realistic "forecasts" on past events from LLMs.
We invite researchers to contact us at hello@futuresearch.ai to utilize our benchmark or tooling for their own research.