Training LLMs for Event Forecasting and Prediction Markets

Deep literature review, PDF library, and step-by-step research plan. Updated July 1, 2026. This page focuses on judgmental event forecasting for important events that can be traded in prediction markets.

Thesis

The literature has moved from "can an LLM answer forecasting questions?" to "can an agentic system, with timestamped retrieval, calibrated probabilities, and market-aware evaluation, add information beyond human crowds and prediction-market prices?" The answer is mixed. LLM systems now approach strong human aggregates in some benchmark settings, but prediction-market studies show that Brier score, calibration, and tradable profit are different targets.

Project recommendation: do not begin with a 30B-scale training run. First build the market data, leakage audit, evaluation, and baseline stack. Then reproduce an 8B OpenForesight-style training recipe. Train larger models only after small systems beat market prices, cheap ensembles, and OpenForecaster-8B on prospective questions after fees, spreads, slippage, and liquidity.

Compute Reality

OpenForesight SFT, Qwen3-8B40 H100-hr
OpenForesight final RL~1,000 H100-hr
OpenForesight ablations~20,000 H100-hr
Outcome-RL 14B setup8 H100 x ~3 days

The final run is not the budget. The ablation loop is the budget.

Market Reality

  • KalshiBench: frontier models are systematically overconfident.
  • Prediction Arena: live Kalshi agents lost money in the reported cohort.
  • PolyBench: only 2 of 7 models had positive simulated order-book returns.
  • TimeSeek: models are most useful early and on uncertain markets.

Failure Modes

  • Retrieval leakage and bad source timestamps.
  • Parametric memory of post-cutoff outcomes.
  • Outcome rewards reinforcing lucky reasoning.
  • Prompting that increases confidence without accuracy.
  • P&L overfitting to fees, depth, and market selection.

Best Public Starting Point

OpenForecaster-8B is the most relevant open model. It is Qwen3-8B post-trained on OpenForesight with GRPO and Accuracy+Brier reward. UniPat EchoZ is useful as a design reference, but I did not find public Hugging Face EchoZ weights.

What The Literature Has Done

1. Static and Dynamic Forecasting Benchmarks

Autocast established the original problem: real forecasting questions, a dated news corpus, and human crowd baselines. The key result was negative for LLMs: the best ML baseline was far below expert human aggregates, though larger models and retrieval helped. Autocast also set the methodological norm that backtests must simulate what was knowable at the forecast date.

Schoenegger and Park tested GPT-4 in a live Metaculus tournament and found it underperformed the median human crowd. Halawi et al. then showed the first serious system improvement: retrieval, reasoning, and aggregation brought an LM system close to the human crowd. Their data scale also matters: 48,754 questions and over 7 million user forecasts, with a human crowd Brier around 0.149 and the system around 0.179 on all questions; simple system+crowd aggregation was even stronger.

ForecastBench moves evaluation from static backtests to dynamic unresolved questions. This is the right direction because any resolved test set becomes contaminated as model cutoffs advance. FutureX, BTF/BTF-2, MIRAI, and Prophet Arena extend this line by adding live updates, frozen corpora, agentic tool environments, and multi-horizon market events.

BenchmarkMain contributionImplication for us
Autocast6,707 tournament questions plus dated news corpus.Timestamped retrieval is mandatory.
Halawi et al.RAG + reasoning + aggregation approaches human crowd.System design beats raw prompting.
ForecastBenchDynamic 1,000-question benchmark from future events.Final claims need prospective tests.
OpenForecast43,417 complex events and 473,155 atomic events for open-ended forecasting.Open-ended events require semantic evaluation, not just binary labels.
OpenForesight52,183 generated forecasting questions and a trained 8B model.Most practical open recipe for post-training.

2. Forecasting Agents and Retrieval

The agent literature converges on a simple pattern: search, reason, update, aggregate. ReAct and Toolformer provide the general tool-use template. Search-R1 shows how to train search use directly with RL and, importantly, masks retrieved tokens so the policy loss is computed only on model-generated tokens. That matters for any forecasting model trained with retrieved context.

The newest systems go beyond appending search results. Bayesian Linguistic Forecaster maintains a structured belief state with probability, evidence for/against, and open questions, runs multiple independent trials, aggregates in logit space, and calibrates with hierarchical Platt scaling. TimeSeek shows search is helpful on average but not uniformly: web search improved pooled Brier Skill Score for all models overall, but hurt in 6 of 50 model-checkpoint conditions. This argues for a selective search/defer/predict gate, not uniform tool use.

3. Training LLMs To Forecast

The training literature has three useful recipes and one caution.

The caution is that outcome rewards do not prove the model learned good reasoning. A well-reasoned miss on a tail event still receives a poor outcome reward, and a lucky guess still receives a good one. Any training result must be paired with calibration, reasoning-quality, and leakage audits.

4. Calibration and Proper Scoring

Forecasting is probability elicitation. Gneiting and Raftery formalize why strictly proper scoring rules incentivize honest probabilities. Hanson connects scoring rules to markets: market scoring rules elicit group consensus, and LMSR gives the bridge between probabilistic forecasts and market prices. LLM-specific calibration papers show that models can verbalize confidence, but scaling and reasoning do not automatically produce calibrated probabilities. KalshiBench is the clearest recent warning: five frontier models were overconfident, and only Claude Opus 4.5 had positive Brier Skill Score in that sample.

5. Temporal Leakage

Leakage is not a side issue. It is the credibility issue. Autocast and Halawi use dated retrieval. ForecastBench avoids leakage with unresolved questions. Pitfalls in Evaluating LM Forecasters argues that many claims still overstate real-world performance. Shapley-DCLR and TimeSPEC are the most useful recent framework: decompose rationales into atomic claims, determine temporal provenance, and weight leaked claims by decision impact. Our project should implement a lighter version first and a full claim-level audit for paper-grade results.

Prediction Markets: What Changes When Events Are Tradable

Prediction-market forecasting is not just another Brier benchmark. A market event has prices, bid/ask spreads, depth, fees, market makers, resolution rules, and trader selection. A model can improve Brier and still lose money if it trades too late, crosses wide spreads, overbets low-liquidity contracts, or follows public prices without adding information.

PaperWhat it didLesson
KalshiBench300 temporally filtered Kalshi questions, five frontier models.Overconfidence is widespread; calibration is a distinct capability.
Prophet Arena1,367 resolved events and 72,136 markets with common contexts.Evaluate Brier, calibration, and economic value together.
Prediction ArenaAutonomous agents trading with real capital on Kalshi and Polymarket.Live trading exposes model, execution, and venue weaknesses.
PolyBench38,666 Polymarket markets with CLOB snapshots and news.Order-book simulation and slippage are core evaluation components.
TimeSeek150 Kalshi markets at five lifecycle checkpoints, with/without search.LLMs add the most value early and in high-uncertainty regimes.
PolySwarm / Evidence MarketsMulti-agent trading and limits of evidence aggregation.Reflexivity, manipulation, and resolution ambiguity are open problems.

Strategic implication: the differentiated project is not simply "train a better model." It is to measure when a model adds information beyond price, when it should defer to the market, and whether that edge survives realistic execution.

Training Strategy

Training is feasible at 8B scale, but the decision should be gated by baselines. The right default is to build an agentic forecaster and evaluator first, then train only where the evaluator shows model-specific errors that post-training can plausibly fix.

OptionProsConsWhen to use
No training: agent + ensembleFast, cheap, strong baseline; can use BLF-style belief states and calibration.Depends on API models; not a proprietary asset.First milestone.
LoRA/SFT on 8BCheap and stable; teaches format, base-rate discipline, and calibration language.Limited true forecasting gain if data are weak.After evaluation stack exists.
8B GRPO/ReMax RLClosest to OpenForesight and outcome-RL recipes.~1,000 H100-hr final run; reward and leakage risks.After baselines show clear training target.
Market-specific RLTargets tradable events directly.P&L reward is noisy and overfit-prone; market impact issues.Only after probability calibration works.
30B-A3B or largerPotentially stronger reasoning and retrieval synthesis.Multi-node complexity and six-figure ablation budget.Only if 8B beats market/ensemble baselines cleanly.

What We Should Do, Step By Step

Step 0
Define the exact target.
Tradable binary/multi-outcome events on Kalshi and Polymarket, plus open-ended event questions for broader generalization.
Done when: we have a written schema for question, forecast date, resolution criteria, source cutoff, market snapshot, outcome, and execution assumptions.
Step 1
Build the market data repository.
Store market prices, bid/ask, CLOB depth, volume, fees, resolution text, close/resolution dates, and snapshots. Do not rely on final market pages.
Done when: every forecast can be replayed from a point-in-time state.
Step 2
Build the timestamped retrieval corpus.
Use static snapshots of news, official releases, SEC/EDGAR, FRED/BLS/BEA, sports APIs, and market pages. Hash documents and retain publication/retrieval timestamps.
Done when: a forecast on date T cannot retrieve documents after T.
Step 3
Implement evaluation before modeling.
Brier, log score, calibration curves, ECE, market-price Brier Skill Score, P&L with fees/spreads/slippage, drawdown, and coverage-risk curves.
Done when: market price and naive baselines can be reproduced automatically.
Step 4
Run no-training baselines.
Market price, market price plus calibration transforms, OpenForecaster-8B, frontier single models, frontier ensembles, and BLF-style agentic search with logit aggregation.
Done when: we know where models add value beyond price and where they should defer.
Step 5
Add leakage audit.
Start with source cutoff checks, answer alias filtering, and post-cutoff phrase detection. Add claim-level Shapley-DCLR-style audit for final experiments.
Done when: each result has a leakage score or clean prospective design.
Step 6
Train small.
Reproduce OpenForesight/OpenForecaster-style training on Qwen3-8B with structured probability outputs, random retrieved chunks, Accuracy+Brier reward, no std-normalized advantages, and strict formatting guardrails.
Done when: the model beats base Qwen3-8B, OpenForecaster-8B, and ensemble baselines on fresh held-out questions.
Step 7
Train market-specific calibration, not direct trading first.
Use Brier/log rewards and calibration losses on resolved tradable events. Treat P&L as a holdout selection metric, not the first reward.
Done when: gains survive fees, spreads, liquidity, and out-of-time markets.
Step 8
Decide on large-model training.
Only scale to 30B-A3B if 8B results show robust leakage-clean alpha or if the scientific contribution requires open weights at larger scale.
Done when: expected value of compute exceeds cheap ensemble and agent alternatives.

Research Questions Worth Writing A Paper Around

  1. Calibration to profit: characterize when a calibrated probability forecaster produces positive expected return under CLOB fees, spreads, and depth.
  2. Selective deference: learn a gate that chooses market price, model, search-agent, or abstain by horizon, liquidity, category, and uncertainty.
  3. Leakage-robust backtesting: adapt Shapley-DCLR to prediction-market rationales and open-ended forecasts.
  4. Cross-venue price discovery: measure Kalshi vs Polymarket lead-lag and whether LLMs exploit or merely follow it.
  5. Reflexivity: test whether public AI forecasts alter prices and degrade future training labels.
  6. Aggregation vs training: compare OpenForesight-style training to a cost-matched ensemble of frontier models.

Hosted PDF Library

I downloaded and uploaded the open PDFs used for this review. Most are arXiv papers; the non-arXiv files are public PDFs from ACL, author/institution pages, or NBER. If a paper is not openly mirrorable, it should be linked rather than rehosted.

AreaPaperPDF
BenchmarkAutocast: Forecasting Future World Events with Neural NetworksPDF
CalibrationLanguage Models (Mostly) Know What They KnowPDF
AgentsReAct: Synergizing Reasoning and ActingPDF
Process rewardsSolving Math Word Problems with Process- and Outcome-Based FeedbackPDF
ToolsToolformer: Language Models Can Teach Themselves to Use ToolsPDF
CalibrationJust Ask for CalibrationPDF
Process rewardsLet's Verify Step by StepPDF
BenchmarkLarge Language Model Prediction CapabilitiesPDF
Process rewardsMath-ShepherdPDF
RLDeepSeekMath and GRPOPDF
Human + AIAI-Augmented PredictionsPDF
SystemApproaching Human-Level Forecasting with Language ModelsPDF
EnsembleWisdom of the Silicon CrowdPDF
LeakageFreshbench / Is Your LLM Outdated?PDF
PromptingCan Language Models Use Forecasting Strategies?PDF
AgentsMIRAI: Evaluating LLM Agents for Event ForecastingPDF
Dynamic evalForecastBenchPDF
BenchmarkNavigating TomorrowPDF
RLDeepSeek-R1PDF
TrainingLLMs Can Teach Themselves to Better Predict the FuturePDF
Search RLSearch-R1PDF
RLDAPOPDF
TrainingOutcome-Based RL to Predict the FuturePDF
LeakageExAntePDF
EvaluationPitfalls in Evaluating LM ForecastersPDF
PromptingPrompt Engineering LLM Forecasting CapabilitiesPDF
BacktestingBench to the FuturePDF
Calibration RLBeyond Binary Rewards / RLCRPDF
TrainingAdvancing Event Forecasting through Massive Training of LLMsPDF
Live evalFutureXPDF
CalibrationConfTunerPDF
MarketsProphet ArenaPDF
MarketsKalshiBenchPDF
TrainingOpenForesight / OpenForecasterPDF
LeakageAll Leaks Count / Shapley-DCLRPDF
MarketsPolySwarmPDF
MarketsTimeSeekPDF
MarketsPrediction ArenaPDF
MarketsPolyBenchPDF
AgentAgentic Forecasting using Sequential Bayesian UpdatingPDF
BacktestingBTF-2PDF
LeakageTeaching LLMs When Not to KnowPDF
MarketsMulti-Agent AI Oracles for Prediction Market ResolutionPDF
CalibrationLLMs Are OverconfidentPDF
MarketsEvidence MarketsPDF
MarketsPrediction Market Accuracy in the Long RunPDF
ScoringStrictly Proper Scoring Rules, Prediction, and EstimationPDF
MarketsLogarithmic Market Scoring RulesPDF
Open-endedOpenForecastPDF
MarketsPrediction Markets for Economic ForecastingPDF