Thesis
The literature has moved from "can an LLM answer forecasting questions?" to "can an agentic system, with timestamped retrieval, calibrated probabilities, and market-aware evaluation, add information beyond human crowds and prediction-market prices?" The answer is mixed. LLM systems now approach strong human aggregates in some benchmark settings, but prediction-market studies show that Brier score, calibration, and tradable profit are different targets.
Project recommendation: do not begin with a 30B-scale training run. First build the market data, leakage audit, evaluation, and baseline stack. Then reproduce an 8B OpenForesight-style training recipe. Train larger models only after small systems beat market prices, cheap ensembles, and OpenForecaster-8B on prospective questions after fees, spreads, slippage, and liquidity.
Compute Reality
The final run is not the budget. The ablation loop is the budget.
Market Reality
- KalshiBench: frontier models are systematically overconfident.
- Prediction Arena: live Kalshi agents lost money in the reported cohort.
- PolyBench: only 2 of 7 models had positive simulated order-book returns.
- TimeSeek: models are most useful early and on uncertain markets.
Failure Modes
- Retrieval leakage and bad source timestamps.
- Parametric memory of post-cutoff outcomes.
- Outcome rewards reinforcing lucky reasoning.
- Prompting that increases confidence without accuracy.
- P&L overfitting to fees, depth, and market selection.
Best Public Starting Point
OpenForecaster-8B is the most relevant open model. It is Qwen3-8B post-trained on OpenForesight with GRPO and Accuracy+Brier reward. UniPat EchoZ is useful as a design reference, but I did not find public Hugging Face EchoZ weights.
What The Literature Has Done
1. Static and Dynamic Forecasting Benchmarks
Autocast established the original problem: real forecasting questions, a dated news corpus, and human crowd baselines. The key result was negative for LLMs: the best ML baseline was far below expert human aggregates, though larger models and retrieval helped. Autocast also set the methodological norm that backtests must simulate what was knowable at the forecast date.
Schoenegger and Park tested GPT-4 in a live Metaculus tournament and found it underperformed the median human crowd. Halawi et al. then showed the first serious system improvement: retrieval, reasoning, and aggregation brought an LM system close to the human crowd. Their data scale also matters: 48,754 questions and over 7 million user forecasts, with a human crowd Brier around 0.149 and the system around 0.179 on all questions; simple system+crowd aggregation was even stronger.
ForecastBench moves evaluation from static backtests to dynamic unresolved questions. This is the right direction because any resolved test set becomes contaminated as model cutoffs advance. FutureX, BTF/BTF-2, MIRAI, and Prophet Arena extend this line by adding live updates, frozen corpora, agentic tool environments, and multi-horizon market events.
| Benchmark | Main contribution | Implication for us |
|---|---|---|
| Autocast | 6,707 tournament questions plus dated news corpus. | Timestamped retrieval is mandatory. |
| Halawi et al. | RAG + reasoning + aggregation approaches human crowd. | System design beats raw prompting. |
| ForecastBench | Dynamic 1,000-question benchmark from future events. | Final claims need prospective tests. |
| OpenForecast | 43,417 complex events and 473,155 atomic events for open-ended forecasting. | Open-ended events require semantic evaluation, not just binary labels. |
| OpenForesight | 52,183 generated forecasting questions and a trained 8B model. | Most practical open recipe for post-training. |
2. Forecasting Agents and Retrieval
The agent literature converges on a simple pattern: search, reason, update, aggregate. ReAct and Toolformer provide the general tool-use template. Search-R1 shows how to train search use directly with RL and, importantly, masks retrieved tokens so the policy loss is computed only on model-generated tokens. That matters for any forecasting model trained with retrieved context.
The newest systems go beyond appending search results. Bayesian Linguistic Forecaster maintains a structured belief state with probability, evidence for/against, and open questions, runs multiple independent trials, aggregates in logit space, and calibrates with hierarchical Platt scaling. TimeSeek shows search is helpful on average but not uniformly: web search improved pooled Brier Skill Score for all models overall, but hurt in 6 of 50 model-checkpoint conditions. This argues for a selective search/defer/predict gate, not uniform tool use.
3. Training LLMs To Forecast
The training literature has three useful recipes and one caution.
- Self-play/DPO: generate pairs of reasoning traces, rank them by realized outcome distance, then DPO fine-tune. Turtel et al. report 7-10% accuracy gains on 14B models. The risk is reinforcing lucky reasoning traces.
- Outcome RL: train on Brier-style rewards using GRPO/ReMax variants. The strongest practical lesson is to avoid per-question standard-deviation normalization when it erases useful calibration gradients. Guardrails and strict output parsing are not optional.
- Accuracy+Brier rewards: RLCR and OpenForesight show why binary correctness alone is dangerous. Calibration needs to be part of the reward. OpenForesight adds an accuracy term because Brier-only can encourage "Unknown" at near-zero confidence.
- OpenForesight scale: 250,000 articles, heavy filtering, 52,183 final samples, Qwen3-8B training, offline retrieval, and GRPO. This is the closest reproducible recipe.
The caution is that outcome rewards do not prove the model learned good reasoning. A well-reasoned miss on a tail event still receives a poor outcome reward, and a lucky guess still receives a good one. Any training result must be paired with calibration, reasoning-quality, and leakage audits.
4. Calibration and Proper Scoring
Forecasting is probability elicitation. Gneiting and Raftery formalize why strictly proper scoring rules incentivize honest probabilities. Hanson connects scoring rules to markets: market scoring rules elicit group consensus, and LMSR gives the bridge between probabilistic forecasts and market prices. LLM-specific calibration papers show that models can verbalize confidence, but scaling and reasoning do not automatically produce calibrated probabilities. KalshiBench is the clearest recent warning: five frontier models were overconfident, and only Claude Opus 4.5 had positive Brier Skill Score in that sample.
5. Temporal Leakage
Leakage is not a side issue. It is the credibility issue. Autocast and Halawi use dated retrieval. ForecastBench avoids leakage with unresolved questions. Pitfalls in Evaluating LM Forecasters argues that many claims still overstate real-world performance. Shapley-DCLR and TimeSPEC are the most useful recent framework: decompose rationales into atomic claims, determine temporal provenance, and weight leaked claims by decision impact. Our project should implement a lighter version first and a full claim-level audit for paper-grade results.
Prediction Markets: What Changes When Events Are Tradable
Prediction-market forecasting is not just another Brier benchmark. A market event has prices, bid/ask spreads, depth, fees, market makers, resolution rules, and trader selection. A model can improve Brier and still lose money if it trades too late, crosses wide spreads, overbets low-liquidity contracts, or follows public prices without adding information.
| Paper | What it did | Lesson |
|---|---|---|
| KalshiBench | 300 temporally filtered Kalshi questions, five frontier models. | Overconfidence is widespread; calibration is a distinct capability. |
| Prophet Arena | 1,367 resolved events and 72,136 markets with common contexts. | Evaluate Brier, calibration, and economic value together. |
| Prediction Arena | Autonomous agents trading with real capital on Kalshi and Polymarket. | Live trading exposes model, execution, and venue weaknesses. |
| PolyBench | 38,666 Polymarket markets with CLOB snapshots and news. | Order-book simulation and slippage are core evaluation components. |
| TimeSeek | 150 Kalshi markets at five lifecycle checkpoints, with/without search. | LLMs add the most value early and in high-uncertainty regimes. |
| PolySwarm / Evidence Markets | Multi-agent trading and limits of evidence aggregation. | Reflexivity, manipulation, and resolution ambiguity are open problems. |
Strategic implication: the differentiated project is not simply "train a better model." It is to measure when a model adds information beyond price, when it should defer to the market, and whether that edge survives realistic execution.
Training Strategy
Training is feasible at 8B scale, but the decision should be gated by baselines. The right default is to build an agentic forecaster and evaluator first, then train only where the evaluator shows model-specific errors that post-training can plausibly fix.
| Option | Pros | Cons | When to use |
|---|---|---|---|
| No training: agent + ensemble | Fast, cheap, strong baseline; can use BLF-style belief states and calibration. | Depends on API models; not a proprietary asset. | First milestone. |
| LoRA/SFT on 8B | Cheap and stable; teaches format, base-rate discipline, and calibration language. | Limited true forecasting gain if data are weak. | After evaluation stack exists. |
| 8B GRPO/ReMax RL | Closest to OpenForesight and outcome-RL recipes. | ~1,000 H100-hr final run; reward and leakage risks. | After baselines show clear training target. |
| Market-specific RL | Targets tradable events directly. | P&L reward is noisy and overfit-prone; market impact issues. | Only after probability calibration works. |
| 30B-A3B or larger | Potentially stronger reasoning and retrieval synthesis. | Multi-node complexity and six-figure ablation budget. | Only if 8B beats market/ensemble baselines cleanly. |
What We Should Do, Step By Step
Tradable binary/multi-outcome events on Kalshi and Polymarket, plus open-ended event questions for broader generalization.
Store market prices, bid/ask, CLOB depth, volume, fees, resolution text, close/resolution dates, and snapshots. Do not rely on final market pages.
Use static snapshots of news, official releases, SEC/EDGAR, FRED/BLS/BEA, sports APIs, and market pages. Hash documents and retain publication/retrieval timestamps.
Brier, log score, calibration curves, ECE, market-price Brier Skill Score, P&L with fees/spreads/slippage, drawdown, and coverage-risk curves.
Market price, market price plus calibration transforms, OpenForecaster-8B, frontier single models, frontier ensembles, and BLF-style agentic search with logit aggregation.
Start with source cutoff checks, answer alias filtering, and post-cutoff phrase detection. Add claim-level Shapley-DCLR-style audit for final experiments.
Reproduce OpenForesight/OpenForecaster-style training on Qwen3-8B with structured probability outputs, random retrieved chunks, Accuracy+Brier reward, no std-normalized advantages, and strict formatting guardrails.
Use Brier/log rewards and calibration losses on resolved tradable events. Treat P&L as a holdout selection metric, not the first reward.
Only scale to 30B-A3B if 8B results show robust leakage-clean alpha or if the scientific contribution requires open weights at larger scale.
Research Questions Worth Writing A Paper Around
- Calibration to profit: characterize when a calibrated probability forecaster produces positive expected return under CLOB fees, spreads, and depth.
- Selective deference: learn a gate that chooses market price, model, search-agent, or abstain by horizon, liquidity, category, and uncertainty.
- Leakage-robust backtesting: adapt Shapley-DCLR to prediction-market rationales and open-ended forecasts.
- Cross-venue price discovery: measure Kalshi vs Polymarket lead-lag and whether LLMs exploit or merely follow it.
- Reflexivity: test whether public AI forecasts alter prices and degrade future training labels.
- Aggregation vs training: compare OpenForesight-style training to a cost-matched ensemble of frontier models.
Hosted PDF Library
I downloaded and uploaded the open PDFs used for this review. Most are arXiv papers; the non-arXiv files are public PDFs from ACL, author/institution pages, or NBER. If a paper is not openly mirrorable, it should be linked rather than rehosted.
| Area | Paper | |
|---|---|---|
| Benchmark | Autocast: Forecasting Future World Events with Neural Networks | |
| Calibration | Language Models (Mostly) Know What They Know | |
| Agents | ReAct: Synergizing Reasoning and Acting | |
| Process rewards | Solving Math Word Problems with Process- and Outcome-Based Feedback | |
| Tools | Toolformer: Language Models Can Teach Themselves to Use Tools | |
| Calibration | Just Ask for Calibration | |
| Process rewards | Let's Verify Step by Step | |
| Benchmark | Large Language Model Prediction Capabilities | |
| Process rewards | Math-Shepherd | |
| RL | DeepSeekMath and GRPO | |
| Human + AI | AI-Augmented Predictions | |
| System | Approaching Human-Level Forecasting with Language Models | |
| Ensemble | Wisdom of the Silicon Crowd | |
| Leakage | Freshbench / Is Your LLM Outdated? | |
| Prompting | Can Language Models Use Forecasting Strategies? | |
| Agents | MIRAI: Evaluating LLM Agents for Event Forecasting | |
| Dynamic eval | ForecastBench | |
| Benchmark | Navigating Tomorrow | |
| RL | DeepSeek-R1 | |
| Training | LLMs Can Teach Themselves to Better Predict the Future | |
| Search RL | Search-R1 | |
| RL | DAPO | |
| Training | Outcome-Based RL to Predict the Future | |
| Leakage | ExAnte | |
| Evaluation | Pitfalls in Evaluating LM Forecasters | |
| Prompting | Prompt Engineering LLM Forecasting Capabilities | |
| Backtesting | Bench to the Future | |
| Calibration RL | Beyond Binary Rewards / RLCR | |
| Training | Advancing Event Forecasting through Massive Training of LLMs | |
| Live eval | FutureX | |
| Calibration | ConfTuner | |
| Markets | Prophet Arena | |
| Markets | KalshiBench | |
| Training | OpenForesight / OpenForecaster | |
| Leakage | All Leaks Count / Shapley-DCLR | |
| Markets | PolySwarm | |
| Markets | TimeSeek | |
| Markets | Prediction Arena | |
| Markets | PolyBench | |
| Agent | Agentic Forecasting using Sequential Bayesian Updating | |
| Backtesting | BTF-2 | |
| Leakage | Teaching LLMs When Not to Know | |
| Markets | Multi-Agent AI Oracles for Prediction Market Resolution | |
| Calibration | LLMs Are Overconfident | |
| Markets | Evidence Markets | |
| Markets | Prediction Market Accuracy in the Long Run | |
| Scoring | Strictly Proper Scoring Rules, Prediction, and Estimation | |
| Markets | Logarithmic Market Scoring Rules | |
| Open-ended | OpenForecast | |
| Markets | Prediction Markets for Economic Forecasting |