LLM Forecasting Literature Review and Research Plan

Thesis

The literature has moved from "can an LLM answer forecasting questions?" to "can an agentic system, with timestamped retrieval, calibrated probabilities, and market-aware evaluation, add information beyond human crowds and prediction-market prices?" The answer is mixed. LLM systems now approach strong human aggregates in some benchmark settings, but prediction-market studies show that Brier score, calibration, and tradable profit are different targets.

Project recommendation: do not begin with a 30B-scale training run. First build the market data, leakage audit, evaluation, and baseline stack. Then reproduce an 8B OpenForesight-style training recipe. Train larger models only after small systems beat market prices, cheap ensembles, and OpenForecaster-8B on prospective questions after fees, spreads, slippage, and liquidity.

Compute Reality

OpenForesight SFT, Qwen3-8B40 H100-hr

OpenForesight final RL~1,000 H100-hr

OpenForesight ablations~20,000 H100-hr

Outcome-RL 14B setup8 H100 x ~3 days

The final run is not the budget. The ablation loop is the budget.

Market Reality

KalshiBench: frontier models are systematically overconfident.
Prediction Arena: live Kalshi agents lost money in the reported cohort.
PolyBench: only 2 of 7 models had positive simulated order-book returns.
TimeSeek: models are most useful early and on uncertain markets.

Failure Modes

Retrieval leakage and bad source timestamps.
Parametric memory of post-cutoff outcomes.
Outcome rewards reinforcing lucky reasoning.
Prompting that increases confidence without accuracy.
P&L overfitting to fees, depth, and market selection.

Best Public Starting Point

OpenForecaster-8B is the most relevant open model. It is Qwen3-8B post-trained on OpenForesight with GRPO and Accuracy+Brier reward. UniPat EchoZ is useful as a design reference, but I did not find public Hugging Face EchoZ weights.

What The Literature Has Done

1. Static and Dynamic Forecasting Benchmarks

Autocast established the original problem: real forecasting questions, a dated news corpus, and human crowd baselines. The key result was negative for LLMs: the best ML baseline was far below expert human aggregates, though larger models and retrieval helped. Autocast also set the methodological norm that backtests must simulate what was knowable at the forecast date.

Schoenegger and Park tested GPT-4 in a live Metaculus tournament and found it underperformed the median human crowd. Halawi et al. then showed the first serious system improvement: retrieval, reasoning, and aggregation brought an LM system close to the human crowd. Their data scale also matters: 48,754 questions and over 7 million user forecasts, with a human crowd Brier around 0.149 and the system around 0.179 on all questions; simple system+crowd aggregation was even stronger.

ForecastBench moves evaluation from static backtests to dynamic unresolved questions. This is the right direction because any resolved test set becomes contaminated as model cutoffs advance. FutureX, BTF/BTF-2, MIRAI, and Prophet Arena extend this line by adding live updates, frozen corpora, agentic tool environments, and multi-horizon market events.

Benchmark	Main contribution	Implication for us
Autocast	6,707 tournament questions plus dated news corpus.	Timestamped retrieval is mandatory.
Halawi et al.	RAG + reasoning + aggregation approaches human crowd.	System design beats raw prompting.
ForecastBench	Dynamic 1,000-question benchmark from future events.	Final claims need prospective tests.
OpenForecast	43,417 complex events and 473,155 atomic events for open-ended forecasting.	Open-ended events require semantic evaluation, not just binary labels.
OpenForesight	52,183 generated forecasting questions and a trained 8B model.	Most practical open recipe for post-training.

2. Forecasting Agents and Retrieval

The agent literature converges on a simple pattern: search, reason, update, aggregate. ReAct and Toolformer provide the general tool-use template. Search-R1 shows how to train search use directly with RL and, importantly, masks retrieved tokens so the policy loss is computed only on model-generated tokens. That matters for any forecasting model trained with retrieved context.

The newest systems go beyond appending search results. Bayesian Linguistic Forecaster maintains a structured belief state with probability, evidence for/against, and open questions, runs multiple independent trials, aggregates in logit space, and calibrates with hierarchical Platt scaling. TimeSeek shows search is helpful on average but not uniformly: web search improved pooled Brier Skill Score for all models overall, but hurt in 6 of 50 model-checkpoint conditions. This argues for a selective search/defer/predict gate, not uniform tool use.

3. Training LLMs To Forecast

The training literature has three useful recipes and one caution.

Self-play/DPO: generate pairs of reasoning traces, rank them by realized outcome distance, then DPO fine-tune. Turtel et al. report 7-10% accuracy gains on 14B models. The risk is reinforcing lucky reasoning traces.
Outcome RL: train on Brier-style rewards using GRPO/ReMax variants. The strongest practical lesson is to avoid per-question standard-deviation normalization when it erases useful calibration gradients. Guardrails and strict output parsing are not optional.
Accuracy+Brier rewards: RLCR and OpenForesight show why binary correctness alone is dangerous. Calibration needs to be part of the reward. OpenForesight adds an accuracy term because Brier-only can encourage "Unknown" at near-zero confidence.
OpenForesight scale: 250,000 articles, heavy filtering, 52,183 final samples, Qwen3-8B training, offline retrieval, and GRPO. This is the closest reproducible recipe.

The caution is that outcome rewards do not prove the model learned good reasoning. A well-reasoned miss on a tail event still receives a poor outcome reward, and a lucky guess still receives a good one. Any training result must be paired with calibration, reasoning-quality, and leakage audits.

4. Calibration and Proper Scoring

Forecasting is probability elicitation. Gneiting and Raftery formalize why strictly proper scoring rules incentivize honest probabilities. Hanson connects scoring rules to markets: market scoring rules elicit group consensus, and LMSR gives the bridge between probabilistic forecasts and market prices. LLM-specific calibration papers show that models can verbalize confidence, but scaling and reasoning do not automatically produce calibrated probabilities. KalshiBench is the clearest recent warning: five frontier models were overconfident, and only Claude Opus 4.5 had positive Brier Skill Score in that sample.

5. Temporal Leakage

Leakage is not a side issue. It is the credibility issue. Autocast and Halawi use dated retrieval. ForecastBench avoids leakage with unresolved questions. Pitfalls in Evaluating LM Forecasters argues that many claims still overstate real-world performance. Shapley-DCLR and TimeSPEC are the most useful recent framework: decompose rationales into atomic claims, determine temporal provenance, and weight leaked claims by decision impact. Our project should implement a lighter version first and a full claim-level audit for paper-grade results.

Prediction Markets: What Changes When Events Are Tradable

Prediction-market forecasting is not just another Brier benchmark. A market event has prices, bid/ask spreads, depth, fees, market makers, resolution rules, and trader selection. A model can improve Brier and still lose money if it trades too late, crosses wide spreads, overbets low-liquidity contracts, or follows public prices without adding information.

Paper	What it did	Lesson
KalshiBench	300 temporally filtered Kalshi questions, five frontier models.	Overconfidence is widespread; calibration is a distinct capability.
Prophet Arena	1,367 resolved events and 72,136 markets with common contexts.	Evaluate Brier, calibration, and economic value together.
Prediction Arena	Autonomous agents trading with real capital on Kalshi and Polymarket.	Live trading exposes model, execution, and venue weaknesses.
PolyBench	38,666 Polymarket markets with CLOB snapshots and news.	Order-book simulation and slippage are core evaluation components.
TimeSeek	150 Kalshi markets at five lifecycle checkpoints, with/without search.	LLMs add the most value early and in high-uncertainty regimes.
PolySwarm / Evidence Markets	Multi-agent trading and limits of evidence aggregation.	Reflexivity, manipulation, and resolution ambiguity are open problems.

Strategic implication: the differentiated project is not simply "train a better model." It is to measure when a model adds information beyond price, when it should defer to the market, and whether that edge survives realistic execution.

Training Strategy

Training is feasible at 8B scale, but the decision should be gated by baselines. The right default is to build an agentic forecaster and evaluator first, then train only where the evaluator shows model-specific errors that post-training can plausibly fix.

Option	Pros	Cons	When to use
No training: agent + ensemble	Fast, cheap, strong baseline; can use BLF-style belief states and calibration.	Depends on API models; not a proprietary asset.	First milestone.
LoRA/SFT on 8B	Cheap and stable; teaches format, base-rate discipline, and calibration language.	Limited true forecasting gain if data are weak.	After evaluation stack exists.
8B GRPO/ReMax RL	Closest to OpenForesight and outcome-RL recipes.	~1,000 H100-hr final run; reward and leakage risks.	After baselines show clear training target.
Market-specific RL	Targets tradable events directly.	P&L reward is noisy and overfit-prone; market impact issues.	Only after probability calibration works.
30B-A3B or larger	Potentially stronger reasoning and retrieval synthesis.	Multi-node complexity and six-figure ablation budget.	Only if 8B beats market/ensemble baselines cleanly.

What We Should Do, Step By Step

Step 0

Define the exact target.
Tradable binary/multi-outcome events on Kalshi and Polymarket, plus open-ended event questions for broader generalization.

Done when: we have a written schema for question, forecast date, resolution criteria, source cutoff, market snapshot, outcome, and execution assumptions.

Step 1

Build the market data repository.
Store market prices, bid/ask, CLOB depth, volume, fees, resolution text, close/resolution dates, and snapshots. Do not rely on final market pages.

Done when: every forecast can be replayed from a point-in-time state.

Step 2

Build the timestamped retrieval corpus.
Use static snapshots of news, official releases, SEC/EDGAR, FRED/BLS/BEA, sports APIs, and market pages. Hash documents and retain publication/retrieval timestamps.

Done when: a forecast on date T cannot retrieve documents after T.

Step 3

Implement evaluation before modeling.
Brier, log score, calibration curves, ECE, market-price Brier Skill Score, P&L with fees/spreads/slippage, drawdown, and coverage-risk curves.

Done when: market price and naive baselines can be reproduced automatically.

Step 4

Run no-training baselines.
Market price, market price plus calibration transforms, OpenForecaster-8B, frontier single models, frontier ensembles, and BLF-style agentic search with logit aggregation.

Done when: we know where models add value beyond price and where they should defer.

Step 5

Add leakage audit.
Start with source cutoff checks, answer alias filtering, and post-cutoff phrase detection. Add claim-level Shapley-DCLR-style audit for final experiments.

Done when: each result has a leakage score or clean prospective design.

Step 6

Train small.
Reproduce OpenForesight/OpenForecaster-style training on Qwen3-8B with structured probability outputs, random retrieved chunks, Accuracy+Brier reward, no std-normalized advantages, and strict formatting guardrails.

Done when: the model beats base Qwen3-8B, OpenForecaster-8B, and ensemble baselines on fresh held-out questions.

Step 7

Train market-specific calibration, not direct trading first.
Use Brier/log rewards and calibration losses on resolved tradable events. Treat P&L as a holdout selection metric, not the first reward.

Done when: gains survive fees, spreads, liquidity, and out-of-time markets.

Step 8

Decide on large-model training.
Only scale to 30B-A3B if 8B results show robust leakage-clean alpha or if the scientific contribution requires open weights at larger scale.

Done when: expected value of compute exceeds cheap ensemble and agent alternatives.

Research Questions Worth Writing A Paper Around

Calibration to profit: characterize when a calibrated probability forecaster produces positive expected return under CLOB fees, spreads, and depth.
Selective deference: learn a gate that chooses market price, model, search-agent, or abstain by horizon, liquidity, category, and uncertainty.
Leakage-robust backtesting: adapt Shapley-DCLR to prediction-market rationales and open-ended forecasts.
Cross-venue price discovery: measure Kalshi vs Polymarket lead-lag and whether LLMs exploit or merely follow it.
Reflexivity: test whether public AI forecasts alter prices and degrade future training labels.
Aggregation vs training: compare OpenForesight-style training to a cost-matched ensemble of frontier models.

Hosted PDF Library

I downloaded and uploaded the open PDFs used for this review. Most are arXiv papers; the non-arXiv files are public PDFs from ACL, author/institution pages, or NBER. If a paper is not openly mirrorable, it should be linked rather than rehosted.

Area	Paper	PDF
Benchmark	Autocast: Forecasting Future World Events with Neural Networks	PDF
Calibration	Language Models (Mostly) Know What They Know	PDF
Agents	ReAct: Synergizing Reasoning and Acting	PDF
Process rewards	Solving Math Word Problems with Process- and Outcome-Based Feedback	PDF
Tools	Toolformer: Language Models Can Teach Themselves to Use Tools	PDF
Calibration	Just Ask for Calibration	PDF
Process rewards	Let's Verify Step by Step	PDF
Benchmark	Large Language Model Prediction Capabilities	PDF
Process rewards	Math-Shepherd	PDF
RL	DeepSeekMath and GRPO	PDF
Human + AI	AI-Augmented Predictions	PDF
System	Approaching Human-Level Forecasting with Language Models	PDF
Ensemble	Wisdom of the Silicon Crowd	PDF
Leakage	Freshbench / Is Your LLM Outdated?	PDF
Prompting	Can Language Models Use Forecasting Strategies?	PDF
Agents	MIRAI: Evaluating LLM Agents for Event Forecasting	PDF
Dynamic eval	ForecastBench	PDF
Benchmark	Navigating Tomorrow	PDF
RL	DeepSeek-R1	PDF
Training	LLMs Can Teach Themselves to Better Predict the Future	PDF
Search RL	Search-R1	PDF
RL	DAPO	PDF
Training	Outcome-Based RL to Predict the Future	PDF
Leakage	ExAnte	PDF
Evaluation	Pitfalls in Evaluating LM Forecasters	PDF
Prompting	Prompt Engineering LLM Forecasting Capabilities	PDF
Backtesting	Bench to the Future	PDF
Calibration RL	Beyond Binary Rewards / RLCR	PDF
Training	Advancing Event Forecasting through Massive Training of LLMs	PDF
Live eval	FutureX	PDF
Calibration	ConfTuner	PDF
Markets	Prophet Arena	PDF
Markets	KalshiBench	PDF
Training	OpenForesight / OpenForecaster	PDF
Leakage	All Leaks Count / Shapley-DCLR	PDF
Markets	PolySwarm	PDF
Markets	TimeSeek	PDF
Markets	Prediction Arena	PDF
Markets	PolyBench	PDF
Agent	Agentic Forecasting using Sequential Bayesian Updating	PDF
Backtesting	BTF-2	PDF
Leakage	Teaching LLMs When Not to Know	PDF
Markets	Multi-Agent AI Oracles for Prediction Market Resolution	PDF
Calibration	LLMs Are Overconfident	PDF
Markets	Evidence Markets	PDF
Markets	Prediction Market Accuracy in the Long Run	PDF
Scoring	Strictly Proper Scoring Rules, Prediction, and Estimation	PDF
Markets	Logarithmic Market Scoring Rules	PDF
Open-ended	OpenForecast	PDF
Markets	Prediction Markets for Economic Forecasting	PDF