FutureSearch Logofuturesearch benchmarks

Want to run on our benchmarks? Please contact us at evals@futuresearch.ai.

FutureSearch evaluates our research agents on the following benchmarks:

Bench to the Future 3 (BTF-3)

BTF-3 is the third edition of our pastcasting benchmark: 1,007 resolved forecasting questions — 759 binary and 248 numeric — researched and forecast against a frozen web corpus. Paper and dataset to follow.

BTF-3 Leaderboard

Evaluated: June 2026

All scores are on the Brier scale; lower is better, and the best score in each column is bolded.

Agent
Pooled score(n=1,007)
Binary(Brier, n=759)
Numeric(RPS, n=248)
1FutureSearch SOTA*
0.120 [0.1090.131]0.115 [0.1010.129]0.137 [0.1230.150]
2Claude Fable 5 (high)
0.126 [0.1140.137]0.124 [0.1090.140]0.130 [0.1170.144]
3Claude Opus 4.8 (xhigh)
0.127 [0.1160.138]0.126 [0.1120.141]0.130 [0.1170.143]
4GPT-5.5 (agent SDK)
0.127 [0.1180.136]0.129 [0.1180.140]0.122 [0.1090.135]
5Claude Opus 4.8 (high)
0.134 [0.1230.145]0.128 [0.1140.143]0.152 [0.1380.165]
6Claude Opus 4.8 (agent SDK)
0.135 [0.1230.147]0.132 [0.1170.147]0.145 [0.1310.160]
7Claude Opus 4.6
0.138 [0.1270.150]0.134 [0.1200.149]0.148 [0.1330.163]
8Claude Opus 4.7
0.140 [0.1280.152]0.134 [0.1200.150]0.157 [0.1400.174]
9GPT-5.5 (high)
0.140 [0.1310.150]0.140 [0.1280.152]0.141 [0.1270.154]
10Gemini 3.5 Flash
0.155 [0.1430.168]0.156 [0.1410.172]0.154 [0.1380.170]

Binary questions are scored by the Brier score (mean squared error of the forecast probability), numeric questions by a normalized ranked probability score (RPS), which generalizes the Brier score to distributional forecasts. The pooled score is the mean across all questions.

Brackets are 95% confidence intervals, computed by percentile bootstrap (5,000 resamples of the question set).

* FutureSearch SOTA synthesizes forecasts from multiple FutureSearch agent runs. † Run with an earlier version of our forecasting agent. ‡ Self-driving run via the model vendor's agent SDK (Claude Agent SDK / OpenAI Agents SDK) instead of our forecasting agent. Claude Fable 5 (high) is missing 23 binary questions (n=736) and 3 numeric questions (n=245). Claude Opus 4.7 is missing one numeric question (n=247).

Pairwise comparisons

Paired bootstrap on pooled scores

1FutureSearch SOTA
2Claude Fable 5 (high)
3Claude Opus 4.8 (xhigh)
4GPT-5.5 (agent SDK)
5Claude Opus 4.8 (high)
6Claude Opus 4.8 (agent SDK)
7Claude Opus 4.6
8Claude Opus 4.7
9GPT-5.5 (high)
10Gemini 3.5 Flash
1FutureSearch SOTA
-.005
-.007*
-.007
-.014***
-.015***
-.018***
-.020***
-.020***
-.035***
2Claude Fable 5 (high)
.005
-.002
-.002
-.008*
-.010*
-.013***
-.015***
-.016***
-.031***
3Claude Opus 4.8 (xhigh)
.007*
.002
.000
-.007*
-.008*
-.011**
-.013***
-.013**
-.028***
4GPT-5.5 (agent SDK)
.007
.002
.000
-.007
-.008
-.011*
-.013**
-.013***
-.028***
5Claude Opus 4.8 (high)
.014***
.008*
.007*
.007
-.001
-.004
-.006
-.006
-.021***
6Claude Opus 4.8 (agent SDK)
.015***
.010*
.008*
.008
.001
-.003
-.005
-.005
-.020***
7Claude Opus 4.6
.018***
.013***
.011**
.011*
.004
.003
-.002
-.002
-.018***
8Claude Opus 4.7
.020***
.015***
.013***
.013**
.006
.005
.002
.000
-.016**
9GPT-5.5 (high)
.020***
.016***
.013**
.013***
.006
.005
.002
.000
-.015**
10Gemini 3.5 Flash
.035***
.031***
.028***
.028***
.021***
.020***
.018***
.016**
.015**

Each cell is the difference in pooled score (row − column) on the questions both agents forecast; negative (green) means the row agent is more accurate. Bold, bordered cells are statistically significant (two-sided paired-bootstrap * p<.05, ** p<.01, *** p<.001); grey cells are not. Hover a cell for the 95% confidence interval, p-value, and shared question count.

Deep Research Bench (DRB)

DRB benchmarks how well LLM agents do research on the web. Each of the 0 diverse, real-world tasks provides 10-100k webpages stored offline for search and reasoning, accompanied by carefully curated answers.

DRB Leaderboard

Last updated:

Agent
Score
Cost ($)
Runtime (s)

Scores averaged first per task category (radar chart), then across all tasks (table). Runtime is estimated from ReAct steps, not wall-clock time.

Papers

Loading radar chart...
No data available

Bench to the Future 2 (BTF-2)

BTF-2 evaluates agents on 1,417 hard forecasting questions. Agents research and forecast offline against a frozen 15M-document corpus. Rationales and reasoning traces are evaluated for strategic reasoning.

BTF-2 Leaderboard

Last updated: 2026-04-20

Agent
Brier
(accuracy)
Calibration
Refinement
FutureSearch Agent0.1190.0020.081
Opus 4.6 Agent0.1300.0050.075
Gemini 3.1 Pro Agent0.1410.0120.069
GPT-5.4 Agent0.1520.0100.056
Grok 4.20 Beta Agent0.1650.0030.039

Brier scores on 1,417 pastcasting questions (lower is better). The FutureSearch Agent is an ensemble significantly more accurate than any single frontier agent. Radar chart shows CHAMPS KNOW strategic emphasis (Borda scores, 8 of 10 dimensions).

Papers

Datasets

Evaluating forecasts on S&P 500 stock returns

View all 500 companies →

On August 5, 2025, FutureSearch research agents forecast revenue, margin, and shareholder payout for each S&P 500 company through 2035.

To evaluate accuracy, we turned those forecasts into a dollar-neutral, industry-neutral paper portfolio, going long the most-undervalued name and short the bottom-half within each GICS industry.

The chart on the left is how this portfolio has done; NVIDIA on the right is a summary view of the forecasts on one company.

For about three weeks, from August 8 through September 2, 2025, this same Aug 5 batch was the publicly accessible Stockfisher product (top 10 companies free, the rest behind paid tiers), before being refreshed with a newer batch of forecasts.

Read more about our methodology for forecasting stock returns: can AI forecast stocks through fundamental analysis?, calculating intrinsic value from scratch, and a superforecasting approach to stock fundamentals.

Net (Long − Short)
+26.6%
Long basket (51)
+31.1%
Short basket (231, underlying)
+4.5%
Forecast As Of
Aug 5, 2025
Sharpe ratio
3.1
Beta vs S&P
0.26
Alpha (ann.) vs S&P
+21%

Industry-neutral, dollar-neutral paper portfolio: long the most undervalued name and short the bottom half within each GICS industry (51 long bets, 231 short bets). Both baskets plotted as underlying returns; Net = Long − Short. No commissions, borrow cost, or dividends. Not investment advice.

Nvidia

(NVDA)Semiconductors & Semiconductor Equipment

Forecasts as of August 6, 2025

Revenue Forecast
2030 forecast $250.0B – $462.5B – $800.0B
Earnings Forecast
2030 forecast $62.5B – $168.8B – $384.0B
Shareholder Payout Forecast
2030 forecast 10.0% – 42.5% – 65.0%

RetroSearch

DRB and BTF-2 use RetroSearch, a system designed to serve agents a frozen, previously scraped version of the internet instead of the live pages, allowing reproducible runs even as the internet changes, and enabling forecasting tasks to be run as "pastcasting".

RetroSearch aims to emulate Google search (specifically, the Serper search API) as closely as possible, so as to minimize differences between live and "retro" agent runs. A single RetroSearch search query follows the following steps:

  • Run a live Serper search for the query
  • Look up pages obtained from live search in the RetroSearch database and other archive sources
  • If the page is not found in the RetroSearch database, remove it from the results
  • Write new snippets from a sample of page content using a simple LLM
  • Return the results in the original format of the Google results

This approach ensures a search experience for agents that is consistent with real search, but backed exclusively by pages we have a frozen candidate for. The following diagram from the paper illustrates the process:

Diagram showing how RetroSearch provides frozen web snapshots to agents
Illustration of the system architecture of Deep Research Bench using RetroSearch. This shows the flow from task definition through the scraping pipeline that populates the RetroSearch database prior to running the benchmark, and then how agents use RetroSearch via an API at the time of task evaluation.