Training LLMs for Event Forecasting and Prediction Markets

Deep literature review, PDF library, and step-by-step research plan. Updated July 1, 2026. This page focuses on judgmental event forecasting for important events that can be traded in prediction markets.

Thesis

The literature has moved from "can an LLM answer forecasting questions?" to "can an agentic system, with timestamped retrieval, calibrated probabilities, and market-aware evaluation, add information beyond human crowds and prediction-market prices?" The answer is mixed. LLM systems now approach strong human aggregates in some benchmark settings, but prediction-market studies show that Brier score, calibration, and tradable profit are different targets.

Project recommendation: do not begin with a 30B-scale training run. First build the market data, leakage audit, evaluation, and baseline stack. Then reproduce an 8B OpenForesight-style training recipe. Train larger models only after small systems beat market prices, cheap ensembles, and OpenForecaster-8B on prospective questions after fees, spreads, slippage, and liquidity.

Compute Reality

OpenForesight SFT, Qwen3-8B40 H100-hr
OpenForesight final RL~1,000 H100-hr
OpenForesight ablations~20,000 H100-hr
Outcome-RL 14B setup8 H100 x ~3 days

The final run is not the budget. The ablation loop is the budget.

Market Reality

  • KalshiBench: frontier models are systematically overconfident.
  • Prediction Arena: live Kalshi agents lost money in the reported cohort.
  • PolyBench: only 2 of 7 models had positive simulated order-book returns.
  • TimeSeek: models are most useful early and on uncertain markets.

Failure Modes

  • Retrieval leakage and bad source timestamps.
  • Parametric memory of post-cutoff outcomes.
  • Outcome rewards reinforcing lucky reasoning.
  • Rubric rewards becoming circular outcome proxies.
  • Prompting that increases confidence without accuracy.
  • P&L overfitting to fees, depth, and market selection.

Best Public Starting Point

OpenForecaster-8B is the most relevant open model. It is Qwen3-8B post-trained on OpenForesight with GRPO and Accuracy+Brier reward. UniPat EchoZ is useful as a design reference, but I did not find public Hugging Face EchoZ weights.

Dataset Repository Status

I built a local dataset catalog and downloader, then pulled the manageable public forecasting and prediction-market datasets onto the server. Raw datasets are stored locally under data/raw/; the website hosts manifests and schema profiles, not the raw multi-gigabyte data.

Downloaded

HF datasets downloaded28
GitHub repos cloned5
Local footprint8.31 GiB
Failed downloads0

High-Value Sources Now Local

OpenForesightForecastBenchFutureXKalshiBench v1/v2Prophet ArenaAutocastMIRAIPolyBenchKalshi tradesPolymarket samplesMetaculus snapshots

Schema signals: KalshiBench has ground_truth and market_probability; OpenForesight has questions, resolution criteria, answers, article timestamps, and generated forecasts; Kalshi/Polymarket mirrors provide prices, trades, bid/ask fields, and order-book-like samples.

Skipped By Policy

  • Polymarket full SII mirror: 159.11 GiB.
  • Polymarket on-chain mirror: 118.45 GiB.
  • Polymarket crypto up/down: 27.08 GiB.
  • Polymarket crypto derivatives: 17.90 GiB and 53k files.
  • OpenForecast: manual Google Drive download.

Artifacts: download inventory, schema profile, download JSON, profile JSON, and pipeline notes.

Server-Side Evaluation Pipeline

The first evaluation package is now in forecast_eval/. It scores binary event forecasts from CSV, JSON, JSONL, or parquet and outputs machine-readable metrics. This is the right first stage: before training, we need stable scoring, market baselines, and point-in-time replay.

ComponentImplemented nowNext extension
Proper scoringBrier score, log score, accuracy at 0.5.Multiclass Brier/log scoring for Echo, Prophet Arena, and non-binary markets.
CalibrationCalibration bins and expected calibration error.Reliability curves by horizon, source, liquidity, and model family.
Market comparisonMarket-price Brier skill score.Matched forecast/market snapshots with bid/ask and timestamp alignment.
Trading backtestOne-share edge-triggered P&L against market probability.Fees, spreads, slippage, depth, partial fills, drawdown, and selective abstention.
ValidationUnit tests passed; sample metrics written to data/eval/.Dataset adapters for KalshiBench, OpenForesight, ForecastBench, Prophet Arena, and Echo snapshots.

Sample command: python3 -m forecast_eval.cli --predictions examples/predictions_sample.csv --out data/eval/sample_metrics.json --calibration-out data/eval/sample_calibration.csv --bins 5 --edge-threshold 0.05.

GLM-5.2 OpenCode Pilot

I ran zai-coding-plan/glm-5.2 through the OpenCode CLI on a partial set of 480 resolved Kalshi markets. This was a first operational test, not a full leakage-clean benchmark. The model was prompted closed-book, without market prices, and scored against the Kalshi cutoff market price.

MetricGLM-5.2Market baselineReadout
Brier score0.24750.1931GLM worse
Log score0.72140.5725GLM worse
Accuracy at 0.561.46%69.79%GLM worse
ECE0.18570.1656GLM worse
Brier skill vs market-28.15%baselineNegative skill

The naive one-share edge backtest showed +17.57 gross units over 406 trades before fees/spreads/slippage, but proper scoring says the model was less calibrated than the market. Treat the P&L as a hypothesis to audit, not as evidence of tradable alpha.

Report and artifacts: GLM-5.2 Kalshi pilot report, metrics JSON, predictions CSV, and calibration CSV.

Echo / UniPat Deep Dive

Key distinction: UniPat AI is the company. Echo is UniPat AI's prediction/forecasting system. EchoZ-1.0 is the forecasting model reported inside Echo. UniScientist is a different UniPat AI model for scientific research tasks, not the Echo forecasting model.

Key finding: EchoZ-1.0 is visible through the public Echo leaderboard and API, but I did not find public Hugging Face EchoZ weights. The public UniPat Hugging Face model is UniScientist-30B-A3B.

Sources checked: UniPat Echo blog, Echo leaderboard, Echo predictions/questions, QbitAI article, UniPat Hugging Face org, and UniScientist GitHub.

Live Echo Snapshot

Snapshot dateJul 1 2026
EchoZ rank#1
EchoZ Elo1024.1
EchoZ battles83,606
EchoZ resolved cases2,396
First prediction dateMar 4 2026

Market Baseline

Polymarket baseline rank#3
Market Elo1011.0
Market battles74,191
Resolved cases1,759

The Echo public API says the Market baseline uses Polymarket prices and only includes questions that exist on Polymarket.

What Echo Is Doing

  • Dynamic question synthesis instead of static train-on-past data.
  • Aligned prediction points so models are compared at the same question/time state.
  • Brier-score differences converted into soft pairwise wins for a Bradley-Terry/Elo-style ranking.
  • Rubrics Search: process rubrics are selected by correlation with held-out outcome-based Elo rankings.
  • Map-Reduce agent architecture for parallel evidence gathering and final probability aggregation.

How The Leaderboard Works

StepMechanismWhy it matters
Question poolPrediction-market questions, synthesized trend questions, and expert questions.Broadens beyond one market/source.
Prediction scheduleQuestions are sampled at multiple pre-resolution points.Measures forecasts across the event lifecycle.
Point alignmentModels are compared only on the same question at the same prediction point.Reduces unfair timing asymmetry.
ScoringBrier-style probability scores and log scores are shown on question pages.Rewards calibrated probabilities, not just top-label accuracy.
RankingBrier-score differences become soft pairwise battles, then a Bradley-Terry/Elo-style rating is fit.Produces a live leaderboard that is less sensitive to missing predictions than average Brier alone.

UniScientist, the Public Hugging Face Model

ItemFindingWhy it matters
HF orgUnipatAI has one public model and seven public datasets.No public EchoZ model or Echo dataset found there.
ModelUniScientist-30B-A3B; 31B params, BF16 safetensors, Qwen3 MoE tag, Apache 2.0.Adjacent rubric/agentic research work, but not event forecasting weights.
Remote sizeAbout 61.08 GB across 25 files.I downloaded metadata/config/tokenizer files and cloned code, not weights.
Training claimUniScientist blog reports ~1,200 H200 GPU-hours, 128k context, up to 100 tool calls, 4,700+ rubric-checked scientific instances.Useful compute reference for rubric-trained agentic models.
Public Echo APIRankings, question lists, question detail, model detail, and model cases expose probabilities and outcomes.Enough to benchmark against Echo-style outputs, not enough to reproduce Echo training.

Public Leaderboard Archive

I downloaded the public Echo leaderboard API into a SQLite snapshot and static exports. This includes the active question list plus resolved questions discovered from public model-case pages, then expanded through per-question detail records.

Archive itemCountWhat it gives us
Models19Public model metadata and leaderboard identities.
Unique questions3,600Active and resolved public questions available through the API.
Model cases41,980Correct/model answers and correctness by model.
Prediction rows462,784Timestamped probability vectors expanded to one row per option.
Ranking history1,330 rowsLast 10 public ranking-history batches by category.

The archive does not include EchoZ weights, prompts, retrieval logs, reasoning traces, rubric scores, or private training data, so it supports leaderboard analysis but not a full temporal-leakage audit.

Artifacts: deep-dive note, leaderboard archive report, SQLite database, snapshot manifest, rankings CSV, questions CSV, model cases CSV, prediction time series CSV, earlier Echo public API snapshot, and UniScientist HF model metadata.

Date note: the March 2026 Echo blog and QbitAI article reported EchoZ-1.0 at Elo 1034.2. The live public API on July 1, 2026 reports 1024.1, so this page uses the newer snapshot for current status and treats the March number as historical.

QbitAI-Style Synthesis: Three Main Streams

The forecasting literature is best read as three converging streams. The first stream builds leakage-controlled datasets and benchmarks; the second stream tries to reward reasoning quality instead of only final outcomes; the third stream uses societies of models, personas, humans, or markets to improve forecasts without relying on one model's raw answer.

The shared goal is model improvement, but "improvement" does not always mean weight updates. For open models, the path is SFT, DPO, GRPO/ReMax, or rubric-based RL. For closed API models, the path is system-level improvement: better retrieval, frozen point-in-time context, persona/society aggregation, calibration layers, market-price gating, and rubric feedback loops. Those API systems can later generate datasets and rubrics for post-training an open model.

StreamCore objectHow it handles leakageHow it improves the model/system
Dataset ConstructionPoint-in-time forecasting tasks, frozen corpora, frozen model versions, market snapshots, and resolution criteria.Backcasting uses only documents available before the simulated forecast date; prospective benchmarks avoid resolved outcomes entirely.Creates clean train/eval data so gains are not just memorization or retrieval leakage.
Process-Based RewardsReasoning traces, step-level validity, rubrics, temporal admissibility, and claim-level leakage weights.Rewards can penalize leaked claims or down-weight outcomes whose rationale depends on post-cutoff facts.Gives denser credit assignment than final outcomes, reducing reinforcement of lucky guesses.
Social ForecastingHuman crowds, model ensembles, personas, debate/society systems, market prices, and aggregation rules.Diversity and aggregation reduce single-model hallucination; point-aligned comparisons keep forecasters in the same information state.Improves API models without fine-tuning and provides strong baselines any trained model must beat.

What This Implies

Dataset Construction is the foundation. A frozen corpus is not enough by itself; we also need frozen model versions, logged prompts, retrieval snapshots, market prices at forecast time, and clear resolution rules. Otherwise backcasting cannot distinguish forecasting skill from the model remembering or retrieving the answer.

Process-Based Rewards are the central scientific bet. Echo/QbitAI's Train-on-Future framing says the model should be rewarded for forecast reasoning quality, not just whether a stochastic outcome happened to resolve correctly. The right implementation is not rubric-only. It is an ablation: outcome-only vs process-only vs hybrid process+Brier, with a leakage audit of the rationale itself.

Social Forecasting is the strongest near-term API route. Personas, independent model samples, market-price anchors, human forecasts, and calibrated aggregation can improve closed frontier APIs without touching weights. This is not a distraction from training; it is the baseline and data generator for later training.

Scientific Principles Across The Literature

1. How Papers Deal With Temporal Leakage

The formal problem is ex-ante inference: at forecast date t_f, the model may only use information publicly knowable at or before t_f, while the event resolves at t_r > t_f. Leakage occurs when the prompt, retrieval corpus, benchmark, model memory, or evaluation process contains information from after t_f. The literature uses four broad controls.

Leakage controlScientific ideaRepresentative papersRemaining weakness
Prospective evaluationAsk questions whose outcomes are genuinely unknown at submission time, then wait for resolution.ForecastBench, FutureX, Metaculus AI Benchmarking, Prediction Arena.Slow feedback loop; small samples until enough events resolve.
Hermetic pastcastingBacktest resolved events against a frozen corpus that contains only documents available before the simulated forecast date.Autocast, Halawi et al., Bench to the Future, BTF-2.Document metadata can be wrong; the base model may already know later outcomes.
Model-cutoff exploitationUse events after the base model's training cutoff as "future" labels for training or evaluation.OpenForesight, self-play/DPO, outcome-RL.Model cutoff is approximate, model vendors rarely expose full data lineage, and future base models will contaminate old tests.
Source filtering and answer filteringRemove post-resolution documents, filter aliases of the gold answer, and impose a buffer such as one month before resolution.OpenForesight, KalshiBench, TimeSeek.String filters miss indirect clues and cannot remove parametric-memory leakage.
Claim-level auditDecompose the rationale into atomic claims, verify each claim's earliest public date, and weight leaked claims by decision impact.Shapley-DCLR / TimeSPEC, ExAnte, Teaching LLMs When Not to Know.Expensive; requires reliable source dating and claim verification.

The important scientific shift is from binary dataset hygiene to causal attribution: not just "did any post-cutoff text appear?" but "did post-cutoff information materially drive the probability?" That is why Shapley-DCLR is conceptually important. It treats leakage as decision-critical information contamination, not merely a dataset property.

2. How Papers Handle Forecasting Agents Given What The LLM Already Knows

Most systems treat the base LLM as a noisy prior: it has broad parametric knowledge up to an uncertain cutoff, but it is stale, overconfident, and may contain forbidden post-cutoff facts in backtests. The agent layer is designed to discipline that prior using timestamped evidence, repeated sampling, aggregation, and calibration.

Agent mechanismPrincipleUsed byWhat it fixes
Dated retrievalCondition the model on external evidence available at the forecast date, instead of relying only on parametric memory.Autocast, Halawi et al., OpenForesight, TimeSeek.Stale model knowledge and missing current events.
Search as an actionLet the model choose queries and iterate between reasoning and evidence acquisition.ReAct, Search-R1, MIRAI, BLF-style agents.One-shot retrieval misses relevant evidence and cannot adapt to uncertainty.
Retrieved-token maskingDuring RL, optimize the model's generated reasoning/search tokens, not copied retrieved text.Search-R1 and search-RL descendants.Prevents the training loss from treating retrieved passages as model behavior.
Belief stateMaintain a structured state with probability, evidence for/against, open questions, and update history.Bayesian Linguistic Forecaster.Avoids unbounded context stuffing and makes evidence updates auditable.
Multi-sample aggregationRun multiple independent forecasts and average in probability or logit space.Halawi et al., Silicon Crowd, BLF, many benchmark agents.Reduces variance and idiosyncratic model errors.
Calibration layerMap raw model probabilities to calibrated probabilities with Platt scaling, Brier training, or post-hoc reliability correction.BLF, RLCR, ConfTuner, KalshiBench analysis.Overconfidence and probability misreporting.
Selective deferenceChoose whether to use the model, search, market price, ensemble, or abstain depending on regime.TimeSeek motivates this; ForecastBench and market papers imply it.Search and models are not uniformly useful across time, category, or market certainty.

The key design principle is separation of roles. The LLM should not be trusted as both evidence store and judge. It should generate hypotheses, queries, and probability estimates, while the system controls timestamped evidence, market snapshots, calibration, aggregation, and leakage audit.

3. Core Issues And How People Are Solving Them

Core issueWhy it mattersSolutions across papersWhat we should do
Labels live in the futureTrue online learning is slow because outcomes resolve later.Pastcasting on frozen corpora; model-cutoff exploitation; dynamic prospective benchmarks.Use pastcasting for iteration, but reserve final claims for prospective market splits.
Temporal leakageBacktests can reward memorization instead of forecasting.Dated retrieval, cutoff filtering, alias filtering, unresolved questions, Shapley-DCLR audits.Implement source cutoff checks first; add claim-level audit for publication-grade results.
OverconfidencePrediction markets punish miscalibration and overbetting.Brier/log scores, RLCR, ConfTuner, Platt scaling, reliability diagrams.Require structured probabilities and report calibration before P&L.
Outcome reward noiseLucky guesses can look better than good reasoning on tail events.Accuracy+Brier rewards, self-play pair ranking, process/rubric rewards, guardrails.Use Brier-based rewards but evaluate reasoning quality and tail-event behavior separately.
Retrieval is useful but riskySearch can add current evidence or introduce noise/leakage.Offline corpora, retrieved-token masking, date-filtered web search, selective tool use.Train/evaluate both with and without search, then learn a gating policy.
Market price is a strong baselineProfit requires beating the crowd, not merely forecasting decently.Brier Skill Score vs market, CLOB simulation, live trading, market-price anchors.Always compare to price and price transforms; evaluate after fees and liquidity.
Training may not beat ensemblingSilicon-crowd aggregation is cheap and strong.LLM ensembles, human+machine averages, logit aggregation, hierarchical calibration.Make cost-matched ensemble the baseline before any large RL run.
Open-ended answers are hard to scoreImportant events are not always binary markets.LLM semantic equivalence grading, LRAE, answer-type filtering, resolution criteria.Use binary markets for first training/evaluation; add open-ended questions after grading is reliable.

Reasoning-Process Rewards, Rubric Rewards, and Echo Clarification

This is the distinction that matters for Echo/EchoZ. Most forecasting papers train on the realized outcome: accuracy, Brier score, log score, or distance to the resolved answer. Echo's public claim is different: it tries to make the reasoning trajectory itself part of the training signal through automated rubric search. That is closer to the process-supervision literature than to ordinary outcome-RL, but it is not free of outcome dependence.

1. Are there papers like this?

Yes, but they fall into two buckets. The first bucket is forecasting-specific: Echo/UniPat is the clearest public example of rubric-scored forecasting reasoning, and FutureWorld explicitly discusses Echo as a live predictive-agent system using rubric-based process rewards. The second bucket is general reasoning: math, science, medicine, instruction following, and open-ended tasks where researchers reward intermediate steps, checklists, or rubrics rather than only final correctness.

Line of workRepresentative papers or systemsWhat they doHow close to prediction markets?
Forecasting-specific rubric/process rewardUniPat Echo / EchoZ, QbitAI/WeChat article, FutureWorld discussion of UniPatScore forecast reasoning trajectories with domain rubrics; Echo searches rubrics whose model rankings agree with outcome-based Elo.Directly relevant, but Echo is a company blog/system claim rather than peer-reviewed evidence.
Forecasting outcome RLSelf-play/DPO, Outcome-Based RL, OpenForesight, FutureWorldGenerate reasoning and forecasts, then train on realized outcomes using Brier-style or preference rewards.Directly relevant, but can reward lucky reasoning unless audited.
Step-level process supervisionUesato et al.; Let's Verify Step by Step; Math-ShepherdJudge whether each intermediate reasoning step is valid, either with human labels or automatic rollouts.Conceptually important; mostly math rather than forecasting.
Rubrics as RL rewardsRubrics as Rewards; Rubric Anchors; RGR-GRPO; OpenRubrics; Checklists Are Better Than Reward ModelsGenerate task-specific criteria, score outputs against criteria, and use the scores as dense RL rewards.Useful for designing forecasting rubrics; not yet validated on tradable event forecasting.
Anti-false-positive process rewardsCuring Miracle Steps; Step-wise Rubric RewardsPenalize correct answers reached by broken reasoning and avoid applying one scalar reward to all steps.Highly relevant to avoiding lucky forecasts, but still needs forecasting-specific adaptation.

2. How do these methods work?

MechanismTraining signalScientific principleForecasting risk
Outcome rewardReward = function of final outcome, such as negative Brier, log score, or correctness.Directly optimizes the target objective and is compatible with proper scoring rules.High variance; a lucky forecast with poor reasoning is rewarded, and a good tail-risk analysis can be punished.
Process reward modelReward each reasoning step for local validity.Credit assignment is easier because the model receives feedback on where reasoning went wrong.For forecasting, "valid reasoning" is not enough; the world can still resolve against a valid argument.
Rubrics as rewardsUse an LLM judge or verifier to score criteria such as base rates, evidence quality, causal mechanisms, uncertainty, and counterarguments.Dense, interpretable reward can guide open-ended reasoning where exact verification is impossible.Judge bias, rubric gaming, and over-rewarding visible reasoning style.
Automated rubric searchGenerate candidate rubrics and select those whose score rankings correlate with outcome-based model rankings.Turns "good reasoning" into a data-driven search problem rather than relying only on hand-written rubrics.It is process-shaped but outcome-calibrated; if the search target is outcome Elo, the rubric can become a proxy for historical outcomes.
Step-wise rubric rewardAttribute rubric items to specific steps and combine per-step rewards with final outcome reward.Fixes the error of giving every token in a response the same scalar reward.Requires high-quality step segmentation and reliable judge attribution.
Delayed live outcome RLStore prediction-time agent trajectories, wait for resolution, backfill rewards, then replay for policy updates.Prevents leakage by construction and aligns training to real outcomes.Slow labels, operational complexity, and still vulnerable to noisy outcome rewards.

3. What Echo appears to add

Echo combines four ideas that were previously scattered: live/future questions, prediction-point aligned Elo evaluation, automated rubric search over forecasting trajectories, and Map-Reduce agent decomposition. The WeChat/QbitAI article describes rubrics such as precursor/external-catalyst evaluation and multi-factor causal synthesis: the system checks whether the model identifies concrete forward-looking catalysts, links them to historical correlations, and integrates independent factors into a causal probability judgment.

The important caveat is circularity. Echo's public description says candidate rubrics are selected by maximizing the correlation between rubric-based rankings and outcome-based Elo rankings. That means the final training signal is not purely independent process truth. It is a process proxy tuned against realized outcomes. This may be useful, but it must be tested against outcome-only RL, rubric-only RL, and hybrid process-plus-Brier RL on fresh markets.

What Our Paper Could Add Beyond The Existing Literature

The clean contribution is not just another trained forecaster. The publishable gap is to test whether reasoning-process rewards actually improve tradable event forecasting once leakage, market prices, and realistic execution are controlled.

Existing literatureWhat it has doneRemaining gapOur contribution
Forecasting systemsRAG, aggregation, dynamic benchmarks, OpenForesight-style outcome RL.Mostly evaluates Brier/accuracy, not whether process rewards improve market alpha.Run outcome-only, rubric-only, and hybrid models on the same timestamped Kalshi/Polymarket split.
Echo / Train-on-FutureProposes live future questions, point-aligned Elo, automated rubric search, and Map-Reduce agents.Not peer-reviewed; no public EchoZ weights found; rubric search is outcome-calibrated.Reproduce the idea transparently with hosted data, held-out prospective questions, and ablations.
Process-supervision papersShow that step-level or rubric rewards can reduce false-positive reasoning in math/open-ended tasks.They do not solve stochastic world outcomes, market prices, or temporal leakage.Design forecasting-specific rubrics for base rates, catalysts, causal synthesis, source dating, and calibrated uncertainty.
Leakage papersUse dated corpora, prospective questions, and claim-level audits.Leakage audits rarely apply to the training reward itself.Audit whether the rubric rewards leaked claims, not just whether final answers were contaminated.
Prediction-market papersShow models can be calibrated poorly and can lose money despite plausible forecasts.They rarely ask which reasoning failures cause bad trades.Connect reasoning-rubric dimensions to Brier Skill Score, expected value, P&L, drawdown, and deference to market price.

Recommended experimental design: start with Qwen3-8B/OpenForecaster-8B, build forecasting rubrics from resolved training markets, then compare four systems: base agent, outcome-RL agent, process-rubric agent, and hybrid process+Brier agent. Final claims should be made only on prospective or strictly point-in-time market questions, with market price as a mandatory baseline.

Prediction Markets: What Changes When Events Are Tradable

Prediction-market forecasting is not just another Brier benchmark. A market event has prices, bid/ask spreads, depth, fees, market makers, resolution rules, and trader selection. A model can improve Brier and still lose money if it trades too late, crosses wide spreads, overbets low-liquidity contracts, or follows public prices without adding information.

PaperWhat it didLesson
KalshiBench300 temporally filtered Kalshi questions, five frontier models.Overconfidence is widespread; calibration is a distinct capability.
Prophet Arena1,367 resolved events and 72,136 markets with common contexts.Evaluate Brier, calibration, and economic value together.
Prediction ArenaAutonomous agents trading with real capital on Kalshi and Polymarket.Live trading exposes model, execution, and venue weaknesses.
PolyBench38,666 Polymarket markets with CLOB snapshots and news.Order-book simulation and slippage are core evaluation components.
TimeSeek150 Kalshi markets at five lifecycle checkpoints, with/without search.LLMs add the most value early and in high-uncertainty regimes.
PolySwarm / Evidence MarketsMulti-agent trading and limits of evidence aggregation.Reflexivity, manipulation, and resolution ambiguity are open problems.

Strategic implication: the differentiated project is not simply "train a better model." It is to measure when a model adds information beyond price, when it should defer to the market, and whether that edge survives realistic execution.

Training Strategy

Training is feasible at 8B scale, but the decision should be gated by baselines. The right default is to build an agentic forecaster and evaluator first, then train only where the evaluator shows model-specific errors that post-training can plausibly fix.

OptionProsConsWhen to use
No training: agent + ensembleFast, cheap, strong baseline; can use BLF-style belief states and calibration.Depends on API models; not a proprietary asset.First milestone.
Closed/API model improvementImproves forecasts through frozen retrieval, personas, self-consistency, market-price gating, calibration, and rubric feedback without changing weights.Harder to own as a model artifact; API providers can change model behavior.Run in parallel with dataset construction and use outputs to create training data.
LoRA/SFT on 8BCheap and stable; teaches format, base-rate discipline, and calibration language.Limited true forecasting gain if data are weak.After evaluation stack exists.
8B GRPO/ReMax RLClosest to OpenForesight and outcome-RL recipes.~1,000 H100-hr final run; reward and leakage risks.After baselines show clear training target.
Market-specific RLTargets tradable events directly.P&L reward is noisy and overfit-prone; market impact issues.Only after probability calibration works.
30B-A3B or largerPotentially stronger reasoning and retrieval synthesis.Multi-node complexity and six-figure ablation budget.Only if 8B beats market/ensemble baselines cleanly.

What We Should Do, Step By Step

Step 0
Define the exact target.
Tradable binary/multi-outcome events on Kalshi and Polymarket, plus open-ended event questions for broader generalization.
Done when: we have a written schema for question, forecast date, resolution criteria, source cutoff, market snapshot, outcome, and execution assumptions.
Step 1
Build the market data repository.
Store market prices, bid/ask, CLOB depth, volume, fees, resolution text, close/resolution dates, and snapshots. Do not rely on final market pages.
Done when: every forecast can be replayed from a point-in-time state.
Step 2
Build the timestamped retrieval corpus.
Use static snapshots of news, official releases, SEC/EDGAR, FRED/BLS/BEA, sports APIs, and market pages. Hash documents and retain publication/retrieval timestamps.
Done when: a forecast on date T cannot retrieve documents after T.
Step 3
Implement evaluation before modeling.
Brier, log score, calibration curves, ECE, market-price Brier Skill Score, P&L with fees/spreads/slippage, drawdown, and coverage-risk curves.
Done when: market price and naive baselines can be reproduced automatically.
Step 4
Run no-training baselines.
Market price, market price plus calibration transforms, OpenForecaster-8B, frontier single models, frontier ensembles, and BLF-style agentic search with logit aggregation.
Done when: we know where models add value beyond price and where they should defer.
Step 5
Add leakage audit.
Start with source cutoff checks, answer alias filtering, and post-cutoff phrase detection. Add claim-level Shapley-DCLR-style audit for final experiments.
Done when: each result has a leakage score or clean prospective design.
Step 6
Train small.
Reproduce OpenForesight/OpenForecaster-style training on Qwen3-8B with structured probability outputs, random retrieved chunks, Accuracy+Brier reward, no std-normalized advantages, and strict formatting guardrails.
Done when: the model beats base Qwen3-8B, OpenForecaster-8B, and ensemble baselines on fresh held-out questions.
Step 7
Train market-specific calibration, not direct trading first.
Use Brier/log rewards and calibration losses on resolved tradable events. Treat P&L as a holdout selection metric, not the first reward.
Done when: gains survive fees, spreads, liquidity, and out-of-time markets.
Step 8
Decide on large-model training.
Only scale to 30B-A3B if 8B results show robust leakage-clean alpha or if the scientific contribution requires open weights at larger scale.
Done when: expected value of compute exceeds cheap ensemble and agent alternatives.

Research Questions Worth Writing A Paper Around

  1. Calibration to profit: characterize when a calibrated probability forecaster produces positive expected return under CLOB fees, spreads, and depth.
  2. Selective deference: learn a gate that chooses market price, model, search-agent, or abstain by horizon, liquidity, category, and uncertainty.
  3. Leakage-robust backtesting: adapt Shapley-DCLR to prediction-market rationales and open-ended forecasts.
  4. Cross-venue price discovery: measure Kalshi vs Polymarket lead-lag and whether LLMs exploit or merely follow it.
  5. Reflexivity: test whether public AI forecasts alter prices and degrade future training labels.
  6. Aggregation vs training: compare OpenForesight-style training to a cost-matched ensemble of frontier models.

Chronological Stream Matrix and Hosted PDF Library

The table is ordered chronologically by first public paper date. It classifies each paper into a stream and states what the paper solved on top of the previous literature. I removed the older pre-LLM market-theory papers from this visible ranked library so the matrix focuses on modern LLM forecasting, datasets, rewards, social forecasting, and API/system improvement.

StreamDateVenue / publicationPaperWhat it solves on top of previous workPDF
Dataset ConstructionJun 2022NeurIPS 2022 Datasets and BenchmarksAutocastIntroduces dated event-forecasting questions and a time-indexed news corpus, making backcasting possible instead of ordinary QA on resolved facts.PDF
API/System ImprovementJul 2022arXiv preprintLanguage Models (Mostly) Know What They KnowAdds the idea that models may expose usable uncertainty, which later supports API-level calibration and confidence elicitation.PDF
API/System ImprovementOct 2022ICLR 2023ReActMoves from static prompting to interleaved reasoning and action, the basic pattern for search-based forecasting agents.PDF
Process-Based RewardsNov 2022arXiv preprintSolving Math Word Problems with Process- and Outcome-Based FeedbackSeparates process supervision from outcome supervision and frames why final-answer reward is weak credit assignment.PDF
API/System ImprovementFeb 2023NeurIPS 2023ToolformerShows models can learn tool use, extending ReAct from prompting into trainable tool-augmented behavior.PDF
API/System ImprovementMay 2023arXiv preprintJust Ask for CalibrationShows verbalized probabilities can improve calibration, a direct method for closed API forecasting systems.PDF
Process-Based RewardsMay 2023arXiv preprintLet's Verify Step by StepDemonstrates that step-level reward models can outperform outcome-only reward models, strengthening the case for reasoning rewards.PDF
Social ForecastingOct 2023arXiv preprintLarge Language Model Prediction CapabilitiesTests GPT-4 against a live human forecasting tournament, establishing that raw frontier APIs underperform human crowds.PDF
Process-Based RewardsDec 2023arXiv preprintMath-ShepherdAutomates process labels with rollouts, reducing dependence on expensive human step annotations.PDF
Process-Based RewardsFeb 2024arXiv technical reportDeepSeekMath / GRPOIntroduces GRPO-style critic-free RL, later reused by forecasting post-training recipes.PDF
Social ForecastingFeb 2024ACM TOIIS; final Feb 2025AI-Augmented PredictionsShows LLM assistance can improve human forecasting, moving beyond model-only evaluation.PDF
Social ForecastingFeb 2024NeurIPS 2024Approaching Human-Level Forecasting with Language ModelsCombines retrieval, reasoning, and aggregation to approach human crowd performance, showing system design beats raw prompting.PDF
Social ForecastingFeb 2024Science Advances; final Nov 2024Wisdom of the Silicon CrowdShows model ensembles can rival human crowds and that human/model aggregation is a powerful low-compute baseline.PDF
Dataset ConstructionMay 2024arXiv preprintFreshbench / Is Your LLM Outdated?Pushes evaluation toward fresh, post-cutoff questions so benchmark results are not stale memorization tests.PDF
API/System ImprovementJun 2024arXiv preprintCan Language Models Use Forecasting Strategies?Tests whether prompting can elicit superforecasting heuristics, clarifying limits of API-only improvement.PDF
Dataset ConstructionJul 2024arXiv preprintMIRAIAdds agentic event forecasting with structured historical events and news, broadening datasets beyond static text questions.PDF
Dataset ConstructionSep 2024ICLR 2025ForecastBenchIntroduces a dynamic unresolved-question benchmark, solving the core leakage problem by waiting for future resolutions.PDF
Dataset ConstructionJan 2025arXiv preprintNavigating TomorrowStudies pre/post-cutoff behavior, making model cutoff itself an explicit evaluation variable.PDF
Process-Based RewardsJan 2025arXiv technical reportDeepSeek-R1Shows large-scale RLVR can improve reasoning while warning that reward models and verifiers can be gamed.PDF
Dataset ConstructionJan 2025COLING 2025OpenForecastExpands forecasting from binary questions to large-scale open-ended event decomposition and semantic resolution.PDF
Process-Based RewardsFeb 2025arXiv preprintLLMs Can Teach Themselves to Better Predict the FutureUses self-play reasoning traces ranked by eventual outcomes, showing post-training can improve forecasting without human-written rationales.PDF
API/System ImprovementMar 2025COLM 2025Search-R1Trains search as part of reasoning and masks retrieved tokens, separating model behavior from copied evidence.PDF
Process-Based RewardsMar 2025arXiv open RL reportDAPOImproves GRPO-style RL infrastructure with stability tricks later applicable to forecasting RL.PDF
Process-Based RewardsMay 2025TMLR; published Nov 2025Outcome-Based RL to Predict the FutureAdapts RL to noisy delayed binary outcomes and shows calibration/profit can improve, but still depends on outcome rewards.PDF
Dataset ConstructionMay 2025arXiv preprintExAnteFormalizes ex-ante inference and evaluates whether models recall post-cutoff outcomes rather than forecasting.PDF
Dataset ConstructionJun 2025arXiv preprintPitfalls in Evaluating LM ForecastersAudits evaluation methodology and explains why static benchmark gains can overstate real forecasting skill.PDF
API/System ImprovementJun 2025arXiv preprintPrompt Engineering LLM Forecasting CapabilitiesSystematically tests prompt variants, showing API-level gains are modest and must be measured rather than assumed.PDF
Dataset ConstructionJun 2025arXiv preprintBench to the FutureBuilds frozen-corpus pastcasting so researchers can iterate faster than prospective benchmarks allow.PDF
Process-Based RewardsJul 2025ICLR 2026Beyond Binary Rewards / RLCRAdds calibrated Brier-style reward to correctness, addressing overconfident outcome-only RL.PDF
Process-Based RewardsJul 2025arXiv preprintRubrics as RewardsUses structured rubrics as RL rewards for open-ended tasks, directly motivating Echo-style process scoring.PDF
Process-Based RewardsJul 2025arXiv preprintChecklists Are Better Than Reward ModelsShows instruction-specific checklists can outperform fixed scalar reward models, supporting task-specific forecasting rubrics.PDF
Process-Based RewardsJul 2025arXiv preprintAdvancing Event Forecasting through Massive Training of LLMsScales event-forecasting post-training, shifting the literature from evaluation to direct model improvement.PDF
Dataset ConstructionAug 2025arXiv preprintFutureXOperationalizes live, continuously updated agent evaluation over many models and event sources.PDF
Process-Based RewardsAug 2025arXiv technical reportReinforcement Learning with Rubric AnchorsShows rubric-based RL can train a Qwen-30B-A3B open-ended reasoning model, relevant to UniScientist-style training.PDF
API/System ImprovementAug 2025NeurIPS 2025ConfTunerTurns verbalized confidence into a trainable Brier-style objective, bridging API confidence and post-training.PDF
Process-Based RewardsOct 2025arXiv preprintOpenRubricsScales synthetic rubric generation, solving the manual-rubric bottleneck for process-reward training.PDF
Process-Based RewardsOct 2025arXiv preprintCuring Miracle StepsTargets false positives where correct answers arise from invalid reasoning, the exact analogue of lucky forecasts.PDF
Dataset ConstructionOct 2025arXiv preprintProphet ArenaAdds common-context prediction-market evaluation with market prices, linking forecasting benchmarks to tradable events.PDF
Process-Based RewardsNov 2025arXiv preprintReward and Guidance through Rubrics / RGR-GRPOCombines rubric reward with offline guidance, improving exploration beyond ordinary verifiable-reward RL.PDF
Dataset ConstructionDec 2025arXiv preprintKalshiBenchUses resolved Kalshi questions after model cutoffs to measure overconfidence on regulated prediction-market events.PDF
Dataset ConstructionDec 2025arXiv preprintOpenForesight / OpenForecasterCombines synthetic open-ended forecast data, frozen retrieval, Accuracy+Brier reward, and a public 8B trained model.PDF
Process-Based RewardsFeb 2026arXiv preprintAll Leaks Count / Shapley-DCLRMoves leakage analysis from dataset-level hygiene to claim-level decision impact, enabling leakage-weighted rewards/evaluation.PDF
Social ForecastingApr 2026arXiv preprintPolySwarmTests multi-agent LLM trading behavior, expanding social forecasting into interacting market agents.PDF
Dataset ConstructionApr 20262nd ICLR Workshop on Advances in Financial AITimeSeekEvaluates Kalshi markets at multiple lifecycle checkpoints and shows when search/API models help or hurt.PDF
Dataset ConstructionApr 2026arXiv preprintPrediction ArenaUses real capital on live Kalshi/Polymarket agents, making execution and realized returns part of evaluation.PDF
Dataset ConstructionApr 2026arXiv preprintPolyBenchAdds CLOB snapshots, news, and simulated returns, solving the gap between Brier evaluation and tradable performance.PDF
Social ForecastingApr 2026arXiv preprintAgentic Forecasting using Sequential Bayesian UpdatingIntroduces belief-state updates and aggregation as a principled alternative to one-shot model forecasts.PDF
Dataset ConstructionApr 2026arXiv preprintBTF-2Extends frozen-corpus pastcasting with a larger corpus and reasoning traces, improving reproducibility.PDF
Process-Based RewardsApr 2026arXiv preprintFutureWorldCloses the loop from live questions to delayed real-world rewards, contrasting outcome backfill with process-only rewards.PDF
Process-Based RewardsMay 2026arXiv preprintTeaching LLMs When Not to KnowTrains/prompt-critiques models to reject post-cutoff knowledge, supporting temporal admissibility in reasoning rewards.PDF
Process-Based RewardsMay 2026arXiv preprintStep-wise Rubric RewardsAttributes rubric items to individual reasoning steps, fixing the problem of applying one scalar rubric reward to all tokens.PDF
Social ForecastingMay 2026arXiv preprintMulti-Agent AI Oracles for Prediction Market ResolutionUses multiple agents for market resolution, extending social forecasting to adjudication and settlement.PDF
API/System ImprovementJun 2026arXiv preprintLLMs Are OverconfidentConsolidates evidence that reasoning/API models remain overconfident, reinforcing calibration as a separate improvement target.PDF
Social ForecastingJun 2026arXiv preprintEvidence MarketsAnalyzes limits of markets as evidence aggregators, warning that social forecasting systems also need structure and governance.PDF