LLM Forecasting Literature Review and Research Plan

Thesis

The literature has moved from "can an LLM answer forecasting questions?" to "can an agentic system, with timestamped retrieval, calibrated probabilities, and market-aware evaluation, add information beyond human crowds and prediction-market prices?" The answer is mixed. LLM systems now approach strong human aggregates in some benchmark settings, but prediction-market studies show that Brier score, calibration, and tradable profit are different targets.

Project recommendation: do not begin with a 30B-scale training run. First build the market data, leakage audit, evaluation, and baseline stack. Then reproduce an 8B OpenForesight-style training recipe. Train larger models only after small systems beat market prices, cheap ensembles, and OpenForecaster-8B on prospective questions after fees, spreads, slippage, and liquidity.

Compute Reality

OpenForesight SFT, Qwen3-8B40 H100-hr

OpenForesight final RL~1,000 H100-hr

OpenForesight ablations~20,000 H100-hr

Outcome-RL 14B setup8 H100 x ~3 days

The final run is not the budget. The ablation loop is the budget.

Market Reality

KalshiBench: frontier models are systematically overconfident.
Prediction Arena: live Kalshi agents lost money in the reported cohort.
PolyBench: only 2 of 7 models had positive simulated order-book returns.
TimeSeek: models are most useful early and on uncertain markets.

Failure Modes

Retrieval leakage and bad source timestamps.
Parametric memory of post-cutoff outcomes.
Outcome rewards reinforcing lucky reasoning.
Rubric rewards becoming circular outcome proxies.
Prompting that increases confidence without accuracy.
P&L overfitting to fees, depth, and market selection.

Best Public Starting Point

OpenForecaster-8B is the most relevant open model. It is Qwen3-8B post-trained on OpenForesight with GRPO and Accuracy+Brier reward. UniPat EchoZ is useful as a design reference, but I did not find public Hugging Face EchoZ weights.

Dataset Repository Status

I built a local dataset catalog and downloader, then pulled the manageable public forecasting and prediction-market datasets onto the server. Raw datasets are stored locally under data/raw/; the website hosts manifests and schema profiles, not the raw multi-gigabyte data.

Downloaded

HF datasets downloaded28

GitHub repos cloned5

Local footprint8.31 GiB

Failed downloads0

High-Value Sources Now Local

OpenForesightForecastBenchFutureXKalshiBench v1/v2Prophet ArenaAutocastMIRAIPolyBenchKalshi tradesPolymarket samplesMetaculus snapshots

Schema signals: KalshiBench has ground_truth and market_probability; OpenForesight has questions, resolution criteria, answers, article timestamps, and generated forecasts; Kalshi/Polymarket mirrors provide prices, trades, bid/ask fields, and order-book-like samples.

Skipped By Policy

Polymarket full SII mirror: 159.11 GiB.
Polymarket on-chain mirror: 118.45 GiB.
Polymarket crypto up/down: 27.08 GiB.
Polymarket crypto derivatives: 17.90 GiB and 53k files.
OpenForecast: manual Google Drive download.

Artifacts: download inventory, schema profile, download JSON, profile JSON, and pipeline notes.

Server-Side Evaluation Pipeline

The first evaluation package is now in forecast_eval/. It scores binary event forecasts from CSV, JSON, JSONL, or parquet and outputs machine-readable metrics. This is the right first stage: before training, we need stable scoring, market baselines, and point-in-time replay.

Component	Implemented now	Next extension
Proper scoring	Brier score, log score, accuracy at 0.5.	Multiclass Brier/log scoring for Echo, Prophet Arena, and non-binary markets.
Calibration	Calibration bins and expected calibration error.	Reliability curves by horizon, source, liquidity, and model family.
Market comparison	Market-price Brier skill score.	Matched forecast/market snapshots with bid/ask and timestamp alignment.
Trading backtest	One-share edge-triggered P&L against market probability.	Fees, spreads, slippage, depth, partial fills, drawdown, and selective abstention.
Validation	Unit tests passed; sample metrics written to `data/eval/`.	Dataset adapters for KalshiBench, OpenForesight, ForecastBench, Prophet Arena, and Echo snapshots.

Sample command: python3 -m forecast_eval.cli --predictions examples/predictions_sample.csv --out data/eval/sample_metrics.json --calibration-out data/eval/sample_calibration.csv --bins 5 --edge-threshold 0.05.

GLM-5.2 OpenCode Pilot

I ran zai-coding-plan/glm-5.2 through the OpenCode CLI on a partial set of 480 resolved Kalshi markets. This was a first operational test, not a full leakage-clean benchmark. The model was prompted closed-book, without market prices, and scored against the Kalshi cutoff market price.

Metric	GLM-5.2	Market baseline	Readout
Brier score	0.2475	0.1931	GLM worse
Log score	0.7214	0.5725	GLM worse
Accuracy at 0.5	61.46%	69.79%	GLM worse
ECE	0.1857	0.1656	GLM worse
Brier skill vs market	-28.15%	baseline	Negative skill

The naive one-share edge backtest showed +17.57 gross units over 406 trades before fees/spreads/slippage, but proper scoring says the model was less calibrated than the market. Treat the P&L as a hypothesis to audit, not as evidence of tradable alpha.

Report and artifacts: GLM-5.2 Kalshi pilot report, metrics JSON, predictions CSV, and calibration CSV.

Echo / UniPat Deep Dive

Key distinction: UniPat AI is the company. Echo is UniPat AI's prediction/forecasting system. EchoZ-1.0 is the forecasting model reported inside Echo. UniScientist is a different UniPat AI model for scientific research tasks, not the Echo forecasting model.

Key finding: EchoZ-1.0 is visible through the public Echo leaderboard and API, but I did not find public Hugging Face EchoZ weights. The public UniPat Hugging Face model is UniScientist-30B-A3B.

Sources checked: UniPat Echo blog, Echo leaderboard, Echo predictions/questions, QbitAI article, UniPat Hugging Face org, and UniScientist GitHub.

Live Echo Snapshot

Snapshot dateJul 1 2026

EchoZ rank#1

EchoZ Elo1024.1

EchoZ battles83,606

EchoZ resolved cases2,396

First prediction dateMar 4 2026

Market Baseline

Polymarket baseline rank#3

Market Elo1011.0

Market battles74,191

Resolved cases1,759

The Echo public API says the Market baseline uses Polymarket prices and only includes questions that exist on Polymarket.

What Echo Is Doing

Dynamic question synthesis instead of static train-on-past data.
Aligned prediction points so models are compared at the same question/time state.
Brier-score differences converted into soft pairwise wins for a Bradley-Terry/Elo-style ranking.
Rubrics Search: process rubrics are selected by correlation with held-out outcome-based Elo rankings.
Map-Reduce agent architecture for parallel evidence gathering and final probability aggregation.

How The Leaderboard Works

Step	Mechanism	Why it matters
Question pool	Prediction-market questions, synthesized trend questions, and expert questions.	Broadens beyond one market/source.
Prediction schedule	Questions are sampled at multiple pre-resolution points.	Measures forecasts across the event lifecycle.
Point alignment	Models are compared only on the same question at the same prediction point.	Reduces unfair timing asymmetry.
Scoring	Brier-style probability scores and log scores are shown on question pages.	Rewards calibrated probabilities, not just top-label accuracy.
Ranking	Brier-score differences become soft pairwise battles, then a Bradley-Terry/Elo-style rating is fit.	Produces a live leaderboard that is less sensitive to missing predictions than average Brier alone.

UniScientist, the Public Hugging Face Model

Item	Finding	Why it matters
HF org	UnipatAI has one public model and seven public datasets.	No public EchoZ model or Echo dataset found there.
Model	UniScientist-30B-A3B; 31B params, BF16 safetensors, Qwen3 MoE tag, Apache 2.0.	Adjacent rubric/agentic research work, but not event forecasting weights.
Remote size	About 61.08 GB across 25 files.	I downloaded metadata/config/tokenizer files and cloned code, not weights.
Training claim	UniScientist blog reports ~1,200 H200 GPU-hours, 128k context, up to 100 tool calls, 4,700+ rubric-checked scientific instances.	Useful compute reference for rubric-trained agentic models.
Public Echo API	Rankings, question lists, question detail, model detail, and model cases expose probabilities and outcomes.	Enough to benchmark against Echo-style outputs, not enough to reproduce Echo training.

Public Leaderboard Archive

I downloaded the public Echo leaderboard API into a SQLite snapshot and static exports. This includes the active question list, resolved questions discovered from public model-case pages, and a follow-up sweep of the full question-id range (1–10,260) that surfaced 1,523 additional questions invisible to case discovery. Browse everything interactively in the Echo Explorer.

Archive item	Count	What it gives us
Models	19	Public model metadata and leaderboard identities.
Unique questions	5,123	Active, waiting, and resolved public questions: 3,600 from list/case discovery plus 1,523 found by probing the question-detail endpoint across the full id range.
Model cases	41,980	Correct/model answers and correctness by model.
Prediction rows	489,124	Timestamped probability vectors expanded to one row per option.
Ranking history	1,330 rows	Last 10 public ranking-history batches by category.

The archive does not include EchoZ weights, prompts, retrieval logs, reasoning traces, rubric scores, or private training data, so it supports leaderboard analysis but not a full temporal-leakage audit.

Artifacts: deep-dive note, leaderboard archive report, SQLite database, snapshot manifest, rankings CSV, questions CSV, model cases CSV, prediction time series CSV, earlier Echo public API snapshot, and UniScientist HF model metadata.

Date note: the March 2026 Echo blog and QbitAI article reported EchoZ-1.0 at Elo 1034.2. The live public API on July 1, 2026 reports 1024.1, so this page uses the newer snapshot for current status and treats the March number as historical.

QbitAI-Style Synthesis: Three Main Streams

The forecasting literature is best read as three converging streams. The first stream builds leakage-controlled datasets and benchmarks; the second stream tries to reward reasoning quality instead of only final outcomes; the third stream uses societies of models, personas, humans, or markets to improve forecasts without relying on one model's raw answer.

The shared goal is model improvement, but "improvement" does not always mean weight updates. For open models, the path is SFT, DPO, GRPO/ReMax, or rubric-based RL. For closed API models, the path is system-level improvement: better retrieval, frozen point-in-time context, persona/society aggregation, calibration layers, market-price gating, and rubric feedback loops. Those API systems can later generate datasets and rubrics for post-training an open model.

Stream	Core object	How it handles leakage	How it improves the model/system
Dataset Construction	Point-in-time forecasting tasks, frozen corpora, frozen model versions, market snapshots, and resolution criteria.	Backcasting uses only documents available before the simulated forecast date; prospective benchmarks avoid resolved outcomes entirely.	Creates clean train/eval data so gains are not just memorization or retrieval leakage.
Process-Based Rewards	Reasoning traces, step-level validity, rubrics, temporal admissibility, and claim-level leakage weights.	Rewards can penalize leaked claims or down-weight outcomes whose rationale depends on post-cutoff facts.	Gives denser credit assignment than final outcomes, reducing reinforcement of lucky guesses.
Social Forecasting	Human crowds, model ensembles, personas, debate/society systems, market prices, and aggregation rules.	Diversity and aggregation reduce single-model hallucination; point-aligned comparisons keep forecasters in the same information state.	Improves API models without fine-tuning and provides strong baselines any trained model must beat.

What This Implies

Dataset Construction is the foundation. A frozen corpus is not enough by itself; we also need frozen model versions, logged prompts, retrieval snapshots, market prices at forecast time, and clear resolution rules. Otherwise backcasting cannot distinguish forecasting skill from the model remembering or retrieving the answer.

Process-Based Rewards are the central scientific bet. Echo/QbitAI's Train-on-Future framing says the model should be rewarded for forecast reasoning quality, not just whether a stochastic outcome happened to resolve correctly. The right implementation is not rubric-only. It is an ablation: outcome-only vs process-only vs hybrid process+Brier, with a leakage audit of the rationale itself.

Social Forecasting is the strongest near-term API route. Personas, independent model samples, market-price anchors, human forecasts, and calibrated aggregation can improve closed frontier APIs without touching weights. This is not a distraction from training; it is the baseline and data generator for later training.

Scientific Principles Across The Literature

1. How Papers Deal With Temporal Leakage

The formal problem is ex-ante inference: at forecast date t_f, the model may only use information publicly knowable at or before t_f, while the event resolves at t_r > t_f. Leakage occurs when the prompt, retrieval corpus, benchmark, model memory, or evaluation process contains information from after t_f. The literature uses four broad controls.

Leakage control	Scientific idea	Representative papers	Remaining weakness
Prospective evaluation	Ask questions whose outcomes are genuinely unknown at submission time, then wait for resolution.	ForecastBench, FutureX, Metaculus AI Benchmarking, Prediction Arena.	Slow feedback loop; small samples until enough events resolve.
Hermetic pastcasting	Backtest resolved events against a frozen corpus that contains only documents available before the simulated forecast date.	Autocast, Halawi et al., Bench to the Future, BTF-2.	Document metadata can be wrong; the base model may already know later outcomes.
Model-cutoff exploitation	Use events after the base model's training cutoff as "future" labels for training or evaluation.	OpenForesight, self-play/DPO, outcome-RL.	Model cutoff is approximate, model vendors rarely expose full data lineage, and future base models will contaminate old tests.
Source filtering and answer filtering	Remove post-resolution documents, filter aliases of the gold answer, and impose a buffer such as one month before resolution.	OpenForesight, KalshiBench, TimeSeek.	String filters miss indirect clues and cannot remove parametric-memory leakage.
Claim-level audit	Decompose the rationale into atomic claims, verify each claim's earliest public date, and weight leaked claims by decision impact.	Shapley-DCLR / TimeSPEC, ExAnte, Teaching LLMs When Not to Know.	Expensive; requires reliable source dating and claim verification.

The important scientific shift is from binary dataset hygiene to causal attribution: not just "did any post-cutoff text appear?" but "did post-cutoff information materially drive the probability?" That is why Shapley-DCLR is conceptually important. It treats leakage as decision-critical information contamination, not merely a dataset property.

2. How Papers Handle Forecasting Agents Given What The LLM Already Knows

Most systems treat the base LLM as a noisy prior: it has broad parametric knowledge up to an uncertain cutoff, but it is stale, overconfident, and may contain forbidden post-cutoff facts in backtests. The agent layer is designed to discipline that prior using timestamped evidence, repeated sampling, aggregation, and calibration.

Agent mechanism	Principle	Used by	What it fixes
Dated retrieval	Condition the model on external evidence available at the forecast date, instead of relying only on parametric memory.	Autocast, Halawi et al., OpenForesight, TimeSeek.	Stale model knowledge and missing current events.
Search as an action	Let the model choose queries and iterate between reasoning and evidence acquisition.	ReAct, Search-R1, MIRAI, BLF-style agents.	One-shot retrieval misses relevant evidence and cannot adapt to uncertainty.
Retrieved-token masking	During RL, optimize the model's generated reasoning/search tokens, not copied retrieved text.	Search-R1 and search-RL descendants.	Prevents the training loss from treating retrieved passages as model behavior.
Belief state	Maintain a structured state with probability, evidence for/against, open questions, and update history.	Bayesian Linguistic Forecaster.	Avoids unbounded context stuffing and makes evidence updates auditable.
Multi-sample aggregation	Run multiple independent forecasts and average in probability or logit space.	Halawi et al., Silicon Crowd, BLF, many benchmark agents.	Reduces variance and idiosyncratic model errors.
Calibration layer	Map raw model probabilities to calibrated probabilities with Platt scaling, Brier training, or post-hoc reliability correction.	BLF, RLCR, ConfTuner, KalshiBench analysis.	Overconfidence and probability misreporting.
Selective deference	Choose whether to use the model, search, market price, ensemble, or abstain depending on regime.	TimeSeek motivates this; ForecastBench and market papers imply it.	Search and models are not uniformly useful across time, category, or market certainty.

The key design principle is separation of roles. The LLM should not be trusted as both evidence store and judge. It should generate hypotheses, queries, and probability estimates, while the system controls timestamped evidence, market snapshots, calibration, aggregation, and leakage audit.

3. Core Issues And How People Are Solving Them

Core issue	Why it matters	Solutions across papers	What we should do
Labels live in the future	True online learning is slow because outcomes resolve later.	Pastcasting on frozen corpora; model-cutoff exploitation; dynamic prospective benchmarks.	Use pastcasting for iteration, but reserve final claims for prospective market splits.
Temporal leakage	Backtests can reward memorization instead of forecasting.	Dated retrieval, cutoff filtering, alias filtering, unresolved questions, Shapley-DCLR audits.	Implement source cutoff checks first; add claim-level audit for publication-grade results.
Overconfidence	Prediction markets punish miscalibration and overbetting.	Brier/log scores, RLCR, ConfTuner, Platt scaling, reliability diagrams.	Require structured probabilities and report calibration before P&L.
Outcome reward noise	Lucky guesses can look better than good reasoning on tail events.	Accuracy+Brier rewards, self-play pair ranking, process/rubric rewards, guardrails.	Use Brier-based rewards but evaluate reasoning quality and tail-event behavior separately.
Retrieval is useful but risky	Search can add current evidence or introduce noise/leakage.	Offline corpora, retrieved-token masking, date-filtered web search, selective tool use.	Train/evaluate both with and without search, then learn a gating policy.
Market price is a strong baseline	Profit requires beating the crowd, not merely forecasting decently.	Brier Skill Score vs market, CLOB simulation, live trading, market-price anchors.	Always compare to price and price transforms; evaluate after fees and liquidity.
Training may not beat ensembling	Silicon-crowd aggregation is cheap and strong.	LLM ensembles, human+machine averages, logit aggregation, hierarchical calibration.	Make cost-matched ensemble the baseline before any large RL run.
Open-ended answers are hard to score	Important events are not always binary markets.	LLM semantic equivalence grading, LRAE, answer-type filtering, resolution criteria.	Use binary markets for first training/evaluation; add open-ended questions after grading is reliable.

Reasoning-Process Rewards, Rubric Rewards, and Echo Clarification

This is the distinction that matters for Echo/EchoZ. Most forecasting papers train on the realized outcome: accuracy, Brier score, log score, or distance to the resolved answer. Echo's public claim is different: it tries to make the reasoning trajectory itself part of the training signal through automated rubric search. That is closer to the process-supervision literature than to ordinary outcome-RL, but it is not free of outcome dependence.

1. Are there papers like this?

Yes, but they fall into two buckets. The first bucket is forecasting-specific: Echo/UniPat is the clearest public example of rubric-scored forecasting reasoning, and FutureWorld explicitly discusses Echo as a live predictive-agent system using rubric-based process rewards. The second bucket is general reasoning: math, science, medicine, instruction following, and open-ended tasks where researchers reward intermediate steps, checklists, or rubrics rather than only final correctness.

Line of work	Representative papers or systems	What they do	How close to prediction markets?
Forecasting-specific rubric/process reward	UniPat Echo / EchoZ, QbitAI/WeChat article, FutureWorld discussion of UniPat	Score forecast reasoning trajectories with domain rubrics; Echo searches rubrics whose model rankings agree with outcome-based Elo.	Directly relevant, but Echo is a company blog/system claim rather than peer-reviewed evidence.
Forecasting outcome RL	Self-play/DPO, Outcome-Based RL, OpenForesight, FutureWorld	Generate reasoning and forecasts, then train on realized outcomes using Brier-style or preference rewards.	Directly relevant, but can reward lucky reasoning unless audited.
Step-level process supervision	Uesato et al.; Let's Verify Step by Step; Math-Shepherd	Judge whether each intermediate reasoning step is valid, either with human labels or automatic rollouts.	Conceptually important; mostly math rather than forecasting.
Rubrics as RL rewards	Rubrics as Rewards; Rubric Anchors; RGR-GRPO; OpenRubrics; Checklists Are Better Than Reward Models	Generate task-specific criteria, score outputs against criteria, and use the scores as dense RL rewards.	Useful for designing forecasting rubrics; not yet validated on tradable event forecasting.
Anti-false-positive process rewards	Curing Miracle Steps; Step-wise Rubric Rewards	Penalize correct answers reached by broken reasoning and avoid applying one scalar reward to all steps.	Highly relevant to avoiding lucky forecasts, but still needs forecasting-specific adaptation.

2. How do these methods work?

Mechanism	Training signal	Scientific principle	Forecasting risk
Outcome reward	Reward = function of final outcome, such as negative Brier, log score, or correctness.	Directly optimizes the target objective and is compatible with proper scoring rules.	High variance; a lucky forecast with poor reasoning is rewarded, and a good tail-risk analysis can be punished.
Process reward model	Reward each reasoning step for local validity.	Credit assignment is easier because the model receives feedback on where reasoning went wrong.	For forecasting, "valid reasoning" is not enough; the world can still resolve against a valid argument.
Rubrics as rewards	Use an LLM judge or verifier to score criteria such as base rates, evidence quality, causal mechanisms, uncertainty, and counterarguments.	Dense, interpretable reward can guide open-ended reasoning where exact verification is impossible.	Judge bias, rubric gaming, and over-rewarding visible reasoning style.
Automated rubric search	Generate candidate rubrics and select those whose score rankings correlate with outcome-based model rankings.	Turns "good reasoning" into a data-driven search problem rather than relying only on hand-written rubrics.	It is process-shaped but outcome-calibrated; if the search target is outcome Elo, the rubric can become a proxy for historical outcomes.
Step-wise rubric reward	Attribute rubric items to specific steps and combine per-step rewards with final outcome reward.	Fixes the error of giving every token in a response the same scalar reward.	Requires high-quality step segmentation and reliable judge attribution.
Delayed live outcome RL	Store prediction-time agent trajectories, wait for resolution, backfill rewards, then replay for policy updates.	Prevents leakage by construction and aligns training to real outcomes.	Slow labels, operational complexity, and still vulnerable to noisy outcome rewards.

3. What Echo appears to add

Echo combines four ideas that were previously scattered: live/future questions, prediction-point aligned Elo evaluation, automated rubric search over forecasting trajectories, and Map-Reduce agent decomposition. The WeChat/QbitAI article describes rubrics such as precursor/external-catalyst evaluation and multi-factor causal synthesis: the system checks whether the model identifies concrete forward-looking catalysts, links them to historical correlations, and integrates independent factors into a causal probability judgment.

The important caveat is circularity. Echo's public description says candidate rubrics are selected by maximizing the correlation between rubric-based rankings and outcome-based Elo rankings. That means the final training signal is not purely independent process truth. It is a process proxy tuned against realized outcomes. This may be useful, but it must be tested against outcome-only RL, rubric-only RL, and hybrid process-plus-Brier RL on fresh markets.

What Our Paper Could Add Beyond The Existing Literature

The clean contribution is not just another trained forecaster. The publishable gap is to test whether reasoning-process rewards actually improve tradable event forecasting once leakage, market prices, and realistic execution are controlled.

Existing literature	What it has done	Remaining gap	Our contribution
Forecasting systems	RAG, aggregation, dynamic benchmarks, OpenForesight-style outcome RL.	Mostly evaluates Brier/accuracy, not whether process rewards improve market alpha.	Run outcome-only, rubric-only, and hybrid models on the same timestamped Kalshi/Polymarket split.
Echo / Train-on-Future	Proposes live future questions, point-aligned Elo, automated rubric search, and Map-Reduce agents.	Not peer-reviewed; no public EchoZ weights found; rubric search is outcome-calibrated.	Reproduce the idea transparently with hosted data, held-out prospective questions, and ablations.
Process-supervision papers	Show that step-level or rubric rewards can reduce false-positive reasoning in math/open-ended tasks.	They do not solve stochastic world outcomes, market prices, or temporal leakage.	Design forecasting-specific rubrics for base rates, catalysts, causal synthesis, source dating, and calibrated uncertainty.
Leakage papers	Use dated corpora, prospective questions, and claim-level audits.	Leakage audits rarely apply to the training reward itself.	Audit whether the rubric rewards leaked claims, not just whether final answers were contaminated.
Prediction-market papers	Show models can be calibrated poorly and can lose money despite plausible forecasts.	They rarely ask which reasoning failures cause bad trades.	Connect reasoning-rubric dimensions to Brier Skill Score, expected value, P&L, drawdown, and deference to market price.

Recommended experimental design: start with Qwen3-8B/OpenForecaster-8B, build forecasting rubrics from resolved training markets, then compare four systems: base agent, outcome-RL agent, process-rubric agent, and hybrid process+Brier agent. Final claims should be made only on prospective or strictly point-in-time market questions, with market price as a mandatory baseline.

Prediction Markets: What Changes When Events Are Tradable

Prediction-market forecasting is not just another Brier benchmark. A market event has prices, bid/ask spreads, depth, fees, market makers, resolution rules, and trader selection. A model can improve Brier and still lose money if it trades too late, crosses wide spreads, overbets low-liquidity contracts, or follows public prices without adding information.

Paper	What it did	Lesson
KalshiBench	300 temporally filtered Kalshi questions, five frontier models.	Overconfidence is widespread; calibration is a distinct capability.
Prophet Arena	1,367 resolved events and 72,136 markets with common contexts.	Evaluate Brier, calibration, and economic value together.
Prediction Arena	Autonomous agents trading with real capital on Kalshi and Polymarket.	Live trading exposes model, execution, and venue weaknesses.
PolyBench	38,666 Polymarket markets with CLOB snapshots and news.	Order-book simulation and slippage are core evaluation components.
TimeSeek	150 Kalshi markets at five lifecycle checkpoints, with/without search.	LLMs add the most value early and in high-uncertainty regimes.
PolySwarm / Evidence Markets	Multi-agent trading and limits of evidence aggregation.	Reflexivity, manipulation, and resolution ambiguity are open problems.

Strategic implication: the differentiated project is not simply "train a better model." It is to measure when a model adds information beyond price, when it should defer to the market, and whether that edge survives realistic execution.

Training Strategy

Training is feasible at 8B scale, but the decision should be gated by baselines. The right default is to build an agentic forecaster and evaluator first, then train only where the evaluator shows model-specific errors that post-training can plausibly fix.

Option	Pros	Cons	When to use
No training: agent + ensemble	Fast, cheap, strong baseline; can use BLF-style belief states and calibration.	Depends on API models; not a proprietary asset.	First milestone.
Closed/API model improvement	Improves forecasts through frozen retrieval, personas, self-consistency, market-price gating, calibration, and rubric feedback without changing weights.	Harder to own as a model artifact; API providers can change model behavior.	Run in parallel with dataset construction and use outputs to create training data.
LoRA/SFT on 8B	Cheap and stable; teaches format, base-rate discipline, and calibration language.	Limited true forecasting gain if data are weak.	After evaluation stack exists.
8B GRPO/ReMax RL	Closest to OpenForesight and outcome-RL recipes.	~1,000 H100-hr final run; reward and leakage risks.	After baselines show clear training target.
Market-specific RL	Targets tradable events directly.	P&L reward is noisy and overfit-prone; market impact issues.	Only after probability calibration works.
30B-A3B or larger	Potentially stronger reasoning and retrieval synthesis.	Multi-node complexity and six-figure ablation budget.	Only if 8B beats market/ensemble baselines cleanly.

What We Should Do, Step By Step

Step 0

Define the exact target.
Tradable binary/multi-outcome events on Kalshi and Polymarket, plus open-ended event questions for broader generalization.

Done when: we have a written schema for question, forecast date, resolution criteria, source cutoff, market snapshot, outcome, and execution assumptions.

Step 1

Build the market data repository.
Store market prices, bid/ask, CLOB depth, volume, fees, resolution text, close/resolution dates, and snapshots. Do not rely on final market pages.

Done when: every forecast can be replayed from a point-in-time state.

Step 2

Build the timestamped retrieval corpus.
Use static snapshots of news, official releases, SEC/EDGAR, FRED/BLS/BEA, sports APIs, and market pages. Hash documents and retain publication/retrieval timestamps.

Done when: a forecast on date T cannot retrieve documents after T.

Step 3

Implement evaluation before modeling.
Brier, log score, calibration curves, ECE, market-price Brier Skill Score, P&L with fees/spreads/slippage, drawdown, and coverage-risk curves.

Done when: market price and naive baselines can be reproduced automatically.

Step 4

Run no-training baselines.
Market price, market price plus calibration transforms, OpenForecaster-8B, frontier single models, frontier ensembles, and BLF-style agentic search with logit aggregation.

Done when: we know where models add value beyond price and where they should defer.

Step 5

Add leakage audit.
Start with source cutoff checks, answer alias filtering, and post-cutoff phrase detection. Add claim-level Shapley-DCLR-style audit for final experiments.

Done when: each result has a leakage score or clean prospective design.

Step 6

Train small.
Reproduce OpenForesight/OpenForecaster-style training on Qwen3-8B with structured probability outputs, random retrieved chunks, Accuracy+Brier reward, no std-normalized advantages, and strict formatting guardrails.

Done when: the model beats base Qwen3-8B, OpenForecaster-8B, and ensemble baselines on fresh held-out questions.

Step 7

Train market-specific calibration, not direct trading first.
Use Brier/log rewards and calibration losses on resolved tradable events. Treat P&L as a holdout selection metric, not the first reward.

Done when: gains survive fees, spreads, liquidity, and out-of-time markets.

Step 8

Decide on large-model training.
Only scale to 30B-A3B if 8B results show robust leakage-clean alpha or if the scientific contribution requires open weights at larger scale.

Done when: expected value of compute exceeds cheap ensemble and agent alternatives.

Research Questions Worth Writing A Paper Around

Calibration to profit: characterize when a calibrated probability forecaster produces positive expected return under CLOB fees, spreads, and depth.
Selective deference: learn a gate that chooses market price, model, search-agent, or abstain by horizon, liquidity, category, and uncertainty.
Leakage-robust backtesting: adapt Shapley-DCLR to prediction-market rationales and open-ended forecasts.
Cross-venue price discovery: measure Kalshi vs Polymarket lead-lag and whether LLMs exploit or merely follow it.
Reflexivity: test whether public AI forecasts alter prices and degrade future training labels.
Aggregation vs training: compare OpenForesight-style training to a cost-matched ensemble of frontier models.

Chronological Stream Matrix and Hosted PDF Library

The table is ordered chronologically by first public paper date. It classifies each paper into a stream and states what the paper solved on top of the previous literature. I removed the older pre-LLM market-theory papers from this visible ranked library so the matrix focuses on modern LLM forecasting, datasets, rewards, social forecasting, and API/system improvement.

Stream	Date	Venue / publication	Paper	What it solves on top of previous work	PDF
Dataset Construction	Jun 2022	NeurIPS 2022 Datasets and Benchmarks	Autocast	Introduces dated event-forecasting questions and a time-indexed news corpus, making backcasting possible instead of ordinary QA on resolved facts.	PDF
API/System Improvement	Jul 2022	arXiv preprint	Language Models (Mostly) Know What They Know	Adds the idea that models may expose usable uncertainty, which later supports API-level calibration and confidence elicitation.	PDF
API/System Improvement	Oct 2022	ICLR 2023	ReAct	Moves from static prompting to interleaved reasoning and action, the basic pattern for search-based forecasting agents.	PDF
Process-Based Rewards	Nov 2022	arXiv preprint	Solving Math Word Problems with Process- and Outcome-Based Feedback	Separates process supervision from outcome supervision and frames why final-answer reward is weak credit assignment.	PDF
API/System Improvement	Feb 2023	NeurIPS 2023	Toolformer	Shows models can learn tool use, extending ReAct from prompting into trainable tool-augmented behavior.	PDF
API/System Improvement	May 2023	arXiv preprint	Just Ask for Calibration	Shows verbalized probabilities can improve calibration, a direct method for closed API forecasting systems.	PDF
Process-Based Rewards	May 2023	arXiv preprint	Let's Verify Step by Step	Demonstrates that step-level reward models can outperform outcome-only reward models, strengthening the case for reasoning rewards.	PDF
Social Forecasting	Oct 2023	arXiv preprint	Large Language Model Prediction Capabilities	Tests GPT-4 against a live human forecasting tournament, establishing that raw frontier APIs underperform human crowds.	PDF
Process-Based Rewards	Dec 2023	arXiv preprint	Math-Shepherd	Automates process labels with rollouts, reducing dependence on expensive human step annotations.	PDF
Process-Based Rewards	Feb 2024	arXiv technical report	DeepSeekMath / GRPO	Introduces GRPO-style critic-free RL, later reused by forecasting post-training recipes.	PDF
Social Forecasting	Feb 2024	ACM TOIIS; final Feb 2025	AI-Augmented Predictions	Shows LLM assistance can improve human forecasting, moving beyond model-only evaluation.	PDF
Social Forecasting	Feb 2024	NeurIPS 2024	Approaching Human-Level Forecasting with Language Models	Combines retrieval, reasoning, and aggregation to approach human crowd performance, showing system design beats raw prompting.	PDF
Social Forecasting	Feb 2024	Science Advances; final Nov 2024	Wisdom of the Silicon Crowd	Shows model ensembles can rival human crowds and that human/model aggregation is a powerful low-compute baseline.	PDF
Dataset Construction	May 2024	arXiv preprint	Freshbench / Is Your LLM Outdated?	Pushes evaluation toward fresh, post-cutoff questions so benchmark results are not stale memorization tests.	PDF
API/System Improvement	Jun 2024	arXiv preprint	Can Language Models Use Forecasting Strategies?	Tests whether prompting can elicit superforecasting heuristics, clarifying limits of API-only improvement.	PDF
Dataset Construction	Jul 2024	arXiv preprint	MIRAI	Adds agentic event forecasting with structured historical events and news, broadening datasets beyond static text questions.	PDF
Dataset Construction	Sep 2024	ICLR 2025	ForecastBench	Introduces a dynamic unresolved-question benchmark, solving the core leakage problem by waiting for future resolutions.	PDF
Dataset Construction	Jan 2025	arXiv preprint	Navigating Tomorrow	Studies pre/post-cutoff behavior, making model cutoff itself an explicit evaluation variable.	PDF
Process-Based Rewards	Jan 2025	arXiv technical report	DeepSeek-R1	Shows large-scale RLVR can improve reasoning while warning that reward models and verifiers can be gamed.	PDF
Dataset Construction	Jan 2025	COLING 2025	OpenForecast	Expands forecasting from binary questions to large-scale open-ended event decomposition and semantic resolution.	PDF
Process-Based Rewards	Feb 2025	arXiv preprint	LLMs Can Teach Themselves to Better Predict the Future	Uses self-play reasoning traces ranked by eventual outcomes, showing post-training can improve forecasting without human-written rationales.	PDF
API/System Improvement	Mar 2025	COLM 2025	Search-R1	Trains search as part of reasoning and masks retrieved tokens, separating model behavior from copied evidence.	PDF
Process-Based Rewards	Mar 2025	arXiv open RL report	DAPO	Improves GRPO-style RL infrastructure with stability tricks later applicable to forecasting RL.	PDF
Process-Based Rewards	May 2025	TMLR; published Nov 2025	Outcome-Based RL to Predict the Future	Adapts RL to noisy delayed binary outcomes and shows calibration/profit can improve, but still depends on outcome rewards.	PDF
Dataset Construction	May 2025	arXiv preprint	ExAnte	Formalizes ex-ante inference and evaluates whether models recall post-cutoff outcomes rather than forecasting.	PDF
Dataset Construction	Jun 2025	arXiv preprint	Pitfalls in Evaluating LM Forecasters	Audits evaluation methodology and explains why static benchmark gains can overstate real forecasting skill.	PDF
API/System Improvement	Jun 2025	arXiv preprint	Prompt Engineering LLM Forecasting Capabilities	Systematically tests prompt variants, showing API-level gains are modest and must be measured rather than assumed.	PDF
Dataset Construction	Jun 2025	arXiv preprint	Bench to the Future	Builds frozen-corpus pastcasting so researchers can iterate faster than prospective benchmarks allow.	PDF
Process-Based Rewards	Jul 2025	ICLR 2026	Beyond Binary Rewards / RLCR	Adds calibrated Brier-style reward to correctness, addressing overconfident outcome-only RL.	PDF
Process-Based Rewards	Jul 2025	arXiv preprint	Rubrics as Rewards	Uses structured rubrics as RL rewards for open-ended tasks, directly motivating Echo-style process scoring.	PDF
Process-Based Rewards	Jul 2025	arXiv preprint	Checklists Are Better Than Reward Models	Shows instruction-specific checklists can outperform fixed scalar reward models, supporting task-specific forecasting rubrics.	PDF
Process-Based Rewards	Jul 2025	arXiv preprint	Advancing Event Forecasting through Massive Training of LLMs	Scales event-forecasting post-training, shifting the literature from evaluation to direct model improvement.	PDF
Dataset Construction	Aug 2025	arXiv preprint	FutureX	Operationalizes live, continuously updated agent evaluation over many models and event sources.	PDF
Process-Based Rewards	Aug 2025	arXiv technical report	Reinforcement Learning with Rubric Anchors	Shows rubric-based RL can train a Qwen-30B-A3B open-ended reasoning model, relevant to UniScientist-style training.	PDF
API/System Improvement	Aug 2025	NeurIPS 2025	ConfTuner	Turns verbalized confidence into a trainable Brier-style objective, bridging API confidence and post-training.	PDF
Process-Based Rewards	Oct 2025	arXiv preprint	OpenRubrics	Scales synthetic rubric generation, solving the manual-rubric bottleneck for process-reward training.	PDF
Process-Based Rewards	Oct 2025	arXiv preprint	Curing Miracle Steps	Targets false positives where correct answers arise from invalid reasoning, the exact analogue of lucky forecasts.	PDF
Dataset Construction	Oct 2025	arXiv preprint	Prophet Arena	Adds common-context prediction-market evaluation with market prices, linking forecasting benchmarks to tradable events.	PDF
Process-Based Rewards	Nov 2025	arXiv preprint	Reward and Guidance through Rubrics / RGR-GRPO	Combines rubric reward with offline guidance, improving exploration beyond ordinary verifiable-reward RL.	PDF
Dataset Construction	Dec 2025	arXiv preprint	KalshiBench	Uses resolved Kalshi questions after model cutoffs to measure overconfidence on regulated prediction-market events.	PDF
Dataset Construction	Dec 2025	arXiv preprint	OpenForesight / OpenForecaster	Combines synthetic open-ended forecast data, frozen retrieval, Accuracy+Brier reward, and a public 8B trained model.	PDF
Process-Based Rewards	Feb 2026	arXiv preprint	All Leaks Count / Shapley-DCLR	Moves leakage analysis from dataset-level hygiene to claim-level decision impact, enabling leakage-weighted rewards/evaluation.	PDF
Social Forecasting	Apr 2026	arXiv preprint	PolySwarm	Tests multi-agent LLM trading behavior, expanding social forecasting into interacting market agents.	PDF
Dataset Construction	Apr 2026	2nd ICLR Workshop on Advances in Financial AI	TimeSeek	Evaluates Kalshi markets at multiple lifecycle checkpoints and shows when search/API models help or hurt.	PDF
Dataset Construction	Apr 2026	arXiv preprint	Prediction Arena	Uses real capital on live Kalshi/Polymarket agents, making execution and realized returns part of evaluation.	PDF
Dataset Construction	Apr 2026	arXiv preprint	PolyBench	Adds CLOB snapshots, news, and simulated returns, solving the gap between Brier evaluation and tradable performance.	PDF
Social Forecasting	Apr 2026	arXiv preprint	Agentic Forecasting using Sequential Bayesian Updating	Introduces belief-state updates and aggregation as a principled alternative to one-shot model forecasts.	PDF
Dataset Construction	Apr 2026	arXiv preprint	BTF-2	Extends frozen-corpus pastcasting with a larger corpus and reasoning traces, improving reproducibility.	PDF
Process-Based Rewards	Apr 2026	arXiv preprint	FutureWorld	Closes the loop from live questions to delayed real-world rewards, contrasting outcome backfill with process-only rewards.	PDF
Process-Based Rewards	May 2026	arXiv preprint	Teaching LLMs When Not to Know	Trains/prompt-critiques models to reject post-cutoff knowledge, supporting temporal admissibility in reasoning rewards.	PDF
Process-Based Rewards	May 2026	arXiv preprint	Step-wise Rubric Rewards	Attributes rubric items to individual reasoning steps, fixing the problem of applying one scalar rubric reward to all tokens.	PDF
Social Forecasting	May 2026	arXiv preprint	Multi-Agent AI Oracles for Prediction Market Resolution	Uses multiple agents for market resolution, extending social forecasting to adjudication and settlement.	PDF
API/System Improvement	Jun 2026	arXiv preprint	LLMs Are Overconfident	Consolidates evidence that reasoning/API models remain overconfident, reinforcing calibration as a separate improvement target.	PDF
Social Forecasting	Jun 2026	arXiv preprint	Evidence Markets	Analyzes limits of markets as evidence aggregators, warning that social forecasting systems also need structure and governance.	PDF