Thesis
The literature has moved from "can an LLM answer forecasting questions?" to "can an agentic system, with timestamped retrieval, calibrated probabilities, and market-aware evaluation, add information beyond human crowds and prediction-market prices?" The answer is mixed. LLM systems now approach strong human aggregates in some benchmark settings, but prediction-market studies show that Brier score, calibration, and tradable profit are different targets.
Project recommendation: do not begin with a 30B-scale training run. First build the market data, leakage audit, evaluation, and baseline stack. Then reproduce an 8B OpenForesight-style training recipe. Train larger models only after small systems beat market prices, cheap ensembles, and OpenForecaster-8B on prospective questions after fees, spreads, slippage, and liquidity.
Compute Reality
The final run is not the budget. The ablation loop is the budget.
Market Reality
- KalshiBench: frontier models are systematically overconfident.
- Prediction Arena: live Kalshi agents lost money in the reported cohort.
- PolyBench: only 2 of 7 models had positive simulated order-book returns.
- TimeSeek: models are most useful early and on uncertain markets.
Failure Modes
- Retrieval leakage and bad source timestamps.
- Parametric memory of post-cutoff outcomes.
- Outcome rewards reinforcing lucky reasoning.
- Rubric rewards becoming circular outcome proxies.
- Prompting that increases confidence without accuracy.
- P&L overfitting to fees, depth, and market selection.
Best Public Starting Point
OpenForecaster-8B is the most relevant open model. It is Qwen3-8B post-trained on OpenForesight with GRPO and Accuracy+Brier reward. UniPat EchoZ is useful as a design reference, but I did not find public Hugging Face EchoZ weights.
Dataset Repository Status
I built a local dataset catalog and downloader, then pulled the manageable public forecasting and prediction-market datasets onto the server. Raw datasets are stored locally under data/raw/; the website hosts manifests and schema profiles, not the raw multi-gigabyte data.
Downloaded
High-Value Sources Now Local
OpenForesightForecastBenchFutureXKalshiBench v1/v2Prophet ArenaAutocastMIRAIPolyBenchKalshi tradesPolymarket samplesMetaculus snapshots
Schema signals: KalshiBench has ground_truth and market_probability; OpenForesight has questions, resolution criteria, answers, article timestamps, and generated forecasts; Kalshi/Polymarket mirrors provide prices, trades, bid/ask fields, and order-book-like samples.
Skipped By Policy
- Polymarket full SII mirror: 159.11 GiB.
- Polymarket on-chain mirror: 118.45 GiB.
- Polymarket crypto up/down: 27.08 GiB.
- Polymarket crypto derivatives: 17.90 GiB and 53k files.
- OpenForecast: manual Google Drive download.
Artifacts: download inventory, schema profile, download JSON, profile JSON, and pipeline notes.
Server-Side Evaluation Pipeline
The first evaluation package is now in forecast_eval/. It scores binary event forecasts from CSV, JSON, JSONL, or parquet and outputs machine-readable metrics. This is the right first stage: before training, we need stable scoring, market baselines, and point-in-time replay.
| Component | Implemented now | Next extension |
|---|---|---|
| Proper scoring | Brier score, log score, accuracy at 0.5. | Multiclass Brier/log scoring for Echo, Prophet Arena, and non-binary markets. |
| Calibration | Calibration bins and expected calibration error. | Reliability curves by horizon, source, liquidity, and model family. |
| Market comparison | Market-price Brier skill score. | Matched forecast/market snapshots with bid/ask and timestamp alignment. |
| Trading backtest | One-share edge-triggered P&L against market probability. | Fees, spreads, slippage, depth, partial fills, drawdown, and selective abstention. |
| Validation | Unit tests passed; sample metrics written to data/eval/. | Dataset adapters for KalshiBench, OpenForesight, ForecastBench, Prophet Arena, and Echo snapshots. |
Sample command: python3 -m forecast_eval.cli --predictions examples/predictions_sample.csv --out data/eval/sample_metrics.json --calibration-out data/eval/sample_calibration.csv --bins 5 --edge-threshold 0.05.
Echo / UniPat Deep Dive
Key finding: EchoZ-1.0 is visible through the public Echo leaderboard and API, but I did not find public Hugging Face EchoZ weights. The public UniPat Hugging Face model is UniScientist-30B-A3B, a rubric-trained scientific research model, not the Echo forecasting model.
Sources checked: UniPat Echo blog, Echo public leaderboard/API, QbitAI article, UniPat Hugging Face org, and UniScientist GitHub.
Live Echo Snapshot
Market Baseline
The Echo public API says the Market baseline uses Polymarket prices and only includes questions that exist on Polymarket.
What Echo Is Doing
- Dynamic question synthesis instead of static train-on-past data.
- Aligned prediction points so models are compared at the same question/time state.
- Brier-score differences converted into soft pairwise wins for a Bradley-Terry/Elo-style ranking.
- Rubrics Search: process rubrics are selected by correlation with held-out outcome-based Elo rankings.
- Map-Reduce agent architecture for parallel evidence gathering and final probability aggregation.
UniScientist, the Public Hugging Face Model
| Item | Finding | Why it matters |
|---|---|---|
| HF org | UnipatAI has one public model and seven public datasets. | No public EchoZ model or Echo dataset found there. |
| Model | UniScientist-30B-A3B; 31B params, BF16 safetensors, Qwen3 MoE tag, Apache 2.0. | Adjacent rubric/agentic research work, but not event forecasting weights. |
| Remote size | About 61.08 GB across 25 files. | I downloaded metadata/config/tokenizer files and cloned code, not weights. |
| Training claim | UniScientist blog reports ~1,200 H200 GPU-hours, 128k context, up to 100 tool calls, 4,700+ rubric-checked scientific instances. | Useful compute reference for rubric-trained agentic models. |
| Public Echo API | Rankings, question lists, question detail, model detail, and model cases expose probabilities and outcomes. | Enough to benchmark against Echo-style outputs, not enough to reproduce Echo training. |
Artifacts: deep-dive note, Echo public API snapshot, and UniScientist HF model metadata.
Date note: the March 2026 Echo blog and QbitAI article reported EchoZ-1.0 at Elo 1034.2. The live public API on July 1, 2026 reports 1024.1, so this page uses the newer snapshot for current status and treats the March number as historical.
QbitAI-Style Synthesis: Three Main Streams
The forecasting literature is best read as three converging streams. The first stream builds leakage-controlled datasets and benchmarks; the second stream tries to reward reasoning quality instead of only final outcomes; the third stream uses societies of models, personas, humans, or markets to improve forecasts without relying on one model's raw answer.
The shared goal is model improvement, but "improvement" does not always mean weight updates. For open models, the path is SFT, DPO, GRPO/ReMax, or rubric-based RL. For closed API models, the path is system-level improvement: better retrieval, frozen point-in-time context, persona/society aggregation, calibration layers, market-price gating, and rubric feedback loops. Those API systems can later generate datasets and rubrics for post-training an open model.
| Stream | Core object | How it handles leakage | How it improves the model/system |
|---|---|---|---|
| Dataset Construction | Point-in-time forecasting tasks, frozen corpora, frozen model versions, market snapshots, and resolution criteria. | Backcasting uses only documents available before the simulated forecast date; prospective benchmarks avoid resolved outcomes entirely. | Creates clean train/eval data so gains are not just memorization or retrieval leakage. |
| Process-Based Rewards | Reasoning traces, step-level validity, rubrics, temporal admissibility, and claim-level leakage weights. | Rewards can penalize leaked claims or down-weight outcomes whose rationale depends on post-cutoff facts. | Gives denser credit assignment than final outcomes, reducing reinforcement of lucky guesses. |
| Social Forecasting | Human crowds, model ensembles, personas, debate/society systems, market prices, and aggregation rules. | Diversity and aggregation reduce single-model hallucination; point-aligned comparisons keep forecasters in the same information state. | Improves API models without fine-tuning and provides strong baselines any trained model must beat. |
What This Implies
Dataset Construction is the foundation. A frozen corpus is not enough by itself; we also need frozen model versions, logged prompts, retrieval snapshots, market prices at forecast time, and clear resolution rules. Otherwise backcasting cannot distinguish forecasting skill from the model remembering or retrieving the answer.
Process-Based Rewards are the central scientific bet. Echo/QbitAI's Train-on-Future framing says the model should be rewarded for forecast reasoning quality, not just whether a stochastic outcome happened to resolve correctly. The right implementation is not rubric-only. It is an ablation: outcome-only vs process-only vs hybrid process+Brier, with a leakage audit of the rationale itself.
Social Forecasting is the strongest near-term API route. Personas, independent model samples, market-price anchors, human forecasts, and calibrated aggregation can improve closed frontier APIs without touching weights. This is not a distraction from training; it is the baseline and data generator for later training.
Scientific Principles Across The Literature
1. How Papers Deal With Temporal Leakage
The formal problem is ex-ante inference: at forecast date t_f, the model may only use information publicly knowable at or before t_f, while the event resolves at t_r > t_f. Leakage occurs when the prompt, retrieval corpus, benchmark, model memory, or evaluation process contains information from after t_f. The literature uses four broad controls.
| Leakage control | Scientific idea | Representative papers | Remaining weakness |
|---|---|---|---|
| Prospective evaluation | Ask questions whose outcomes are genuinely unknown at submission time, then wait for resolution. | ForecastBench, FutureX, Metaculus AI Benchmarking, Prediction Arena. | Slow feedback loop; small samples until enough events resolve. |
| Hermetic pastcasting | Backtest resolved events against a frozen corpus that contains only documents available before the simulated forecast date. | Autocast, Halawi et al., Bench to the Future, BTF-2. | Document metadata can be wrong; the base model may already know later outcomes. |
| Model-cutoff exploitation | Use events after the base model's training cutoff as "future" labels for training or evaluation. | OpenForesight, self-play/DPO, outcome-RL. | Model cutoff is approximate, model vendors rarely expose full data lineage, and future base models will contaminate old tests. |
| Source filtering and answer filtering | Remove post-resolution documents, filter aliases of the gold answer, and impose a buffer such as one month before resolution. | OpenForesight, KalshiBench, TimeSeek. | String filters miss indirect clues and cannot remove parametric-memory leakage. |
| Claim-level audit | Decompose the rationale into atomic claims, verify each claim's earliest public date, and weight leaked claims by decision impact. | Shapley-DCLR / TimeSPEC, ExAnte, Teaching LLMs When Not to Know. | Expensive; requires reliable source dating and claim verification. |
The important scientific shift is from binary dataset hygiene to causal attribution: not just "did any post-cutoff text appear?" but "did post-cutoff information materially drive the probability?" That is why Shapley-DCLR is conceptually important. It treats leakage as decision-critical information contamination, not merely a dataset property.
2. How Papers Handle Forecasting Agents Given What The LLM Already Knows
Most systems treat the base LLM as a noisy prior: it has broad parametric knowledge up to an uncertain cutoff, but it is stale, overconfident, and may contain forbidden post-cutoff facts in backtests. The agent layer is designed to discipline that prior using timestamped evidence, repeated sampling, aggregation, and calibration.
| Agent mechanism | Principle | Used by | What it fixes |
|---|---|---|---|
| Dated retrieval | Condition the model on external evidence available at the forecast date, instead of relying only on parametric memory. | Autocast, Halawi et al., OpenForesight, TimeSeek. | Stale model knowledge and missing current events. |
| Search as an action | Let the model choose queries and iterate between reasoning and evidence acquisition. | ReAct, Search-R1, MIRAI, BLF-style agents. | One-shot retrieval misses relevant evidence and cannot adapt to uncertainty. |
| Retrieved-token masking | During RL, optimize the model's generated reasoning/search tokens, not copied retrieved text. | Search-R1 and search-RL descendants. | Prevents the training loss from treating retrieved passages as model behavior. |
| Belief state | Maintain a structured state with probability, evidence for/against, open questions, and update history. | Bayesian Linguistic Forecaster. | Avoids unbounded context stuffing and makes evidence updates auditable. |
| Multi-sample aggregation | Run multiple independent forecasts and average in probability or logit space. | Halawi et al., Silicon Crowd, BLF, many benchmark agents. | Reduces variance and idiosyncratic model errors. |
| Calibration layer | Map raw model probabilities to calibrated probabilities with Platt scaling, Brier training, or post-hoc reliability correction. | BLF, RLCR, ConfTuner, KalshiBench analysis. | Overconfidence and probability misreporting. |
| Selective deference | Choose whether to use the model, search, market price, ensemble, or abstain depending on regime. | TimeSeek motivates this; ForecastBench and market papers imply it. | Search and models are not uniformly useful across time, category, or market certainty. |
The key design principle is separation of roles. The LLM should not be trusted as both evidence store and judge. It should generate hypotheses, queries, and probability estimates, while the system controls timestamped evidence, market snapshots, calibration, aggregation, and leakage audit.
3. Core Issues And How People Are Solving Them
| Core issue | Why it matters | Solutions across papers | What we should do |
|---|---|---|---|
| Labels live in the future | True online learning is slow because outcomes resolve later. | Pastcasting on frozen corpora; model-cutoff exploitation; dynamic prospective benchmarks. | Use pastcasting for iteration, but reserve final claims for prospective market splits. |
| Temporal leakage | Backtests can reward memorization instead of forecasting. | Dated retrieval, cutoff filtering, alias filtering, unresolved questions, Shapley-DCLR audits. | Implement source cutoff checks first; add claim-level audit for publication-grade results. |
| Overconfidence | Prediction markets punish miscalibration and overbetting. | Brier/log scores, RLCR, ConfTuner, Platt scaling, reliability diagrams. | Require structured probabilities and report calibration before P&L. |
| Outcome reward noise | Lucky guesses can look better than good reasoning on tail events. | Accuracy+Brier rewards, self-play pair ranking, process/rubric rewards, guardrails. | Use Brier-based rewards but evaluate reasoning quality and tail-event behavior separately. |
| Retrieval is useful but risky | Search can add current evidence or introduce noise/leakage. | Offline corpora, retrieved-token masking, date-filtered web search, selective tool use. | Train/evaluate both with and without search, then learn a gating policy. |
| Market price is a strong baseline | Profit requires beating the crowd, not merely forecasting decently. | Brier Skill Score vs market, CLOB simulation, live trading, market-price anchors. | Always compare to price and price transforms; evaluate after fees and liquidity. |
| Training may not beat ensembling | Silicon-crowd aggregation is cheap and strong. | LLM ensembles, human+machine averages, logit aggregation, hierarchical calibration. | Make cost-matched ensemble the baseline before any large RL run. |
| Open-ended answers are hard to score | Important events are not always binary markets. | LLM semantic equivalence grading, LRAE, answer-type filtering, resolution criteria. | Use binary markets for first training/evaluation; add open-ended questions after grading is reliable. |
Reasoning-Process Rewards, Rubric Rewards, and Echo Clarification
This is the distinction that matters for Echo/EchoZ. Most forecasting papers train on the realized outcome: accuracy, Brier score, log score, or distance to the resolved answer. Echo's public claim is different: it tries to make the reasoning trajectory itself part of the training signal through automated rubric search. That is closer to the process-supervision literature than to ordinary outcome-RL, but it is not free of outcome dependence.
1. Are there papers like this?
Yes, but they fall into two buckets. The first bucket is forecasting-specific: Echo/UniPat is the clearest public example of rubric-scored forecasting reasoning, and FutureWorld explicitly discusses Echo as a live predictive-agent system using rubric-based process rewards. The second bucket is general reasoning: math, science, medicine, instruction following, and open-ended tasks where researchers reward intermediate steps, checklists, or rubrics rather than only final correctness.
| Line of work | Representative papers or systems | What they do | How close to prediction markets? |
|---|---|---|---|
| Forecasting-specific rubric/process reward | UniPat Echo / EchoZ, QbitAI/WeChat article, FutureWorld discussion of UniPat | Score forecast reasoning trajectories with domain rubrics; Echo searches rubrics whose model rankings agree with outcome-based Elo. | Directly relevant, but Echo is a company blog/system claim rather than peer-reviewed evidence. |
| Forecasting outcome RL | Self-play/DPO, Outcome-Based RL, OpenForesight, FutureWorld | Generate reasoning and forecasts, then train on realized outcomes using Brier-style or preference rewards. | Directly relevant, but can reward lucky reasoning unless audited. |
| Step-level process supervision | Uesato et al.; Let's Verify Step by Step; Math-Shepherd | Judge whether each intermediate reasoning step is valid, either with human labels or automatic rollouts. | Conceptually important; mostly math rather than forecasting. |
| Rubrics as RL rewards | Rubrics as Rewards; Rubric Anchors; RGR-GRPO; OpenRubrics; Checklists Are Better Than Reward Models | Generate task-specific criteria, score outputs against criteria, and use the scores as dense RL rewards. | Useful for designing forecasting rubrics; not yet validated on tradable event forecasting. |
| Anti-false-positive process rewards | Curing Miracle Steps; Step-wise Rubric Rewards | Penalize correct answers reached by broken reasoning and avoid applying one scalar reward to all steps. | Highly relevant to avoiding lucky forecasts, but still needs forecasting-specific adaptation. |
2. How do these methods work?
| Mechanism | Training signal | Scientific principle | Forecasting risk |
|---|---|---|---|
| Outcome reward | Reward = function of final outcome, such as negative Brier, log score, or correctness. | Directly optimizes the target objective and is compatible with proper scoring rules. | High variance; a lucky forecast with poor reasoning is rewarded, and a good tail-risk analysis can be punished. |
| Process reward model | Reward each reasoning step for local validity. | Credit assignment is easier because the model receives feedback on where reasoning went wrong. | For forecasting, "valid reasoning" is not enough; the world can still resolve against a valid argument. |
| Rubrics as rewards | Use an LLM judge or verifier to score criteria such as base rates, evidence quality, causal mechanisms, uncertainty, and counterarguments. | Dense, interpretable reward can guide open-ended reasoning where exact verification is impossible. | Judge bias, rubric gaming, and over-rewarding visible reasoning style. |
| Automated rubric search | Generate candidate rubrics and select those whose score rankings correlate with outcome-based model rankings. | Turns "good reasoning" into a data-driven search problem rather than relying only on hand-written rubrics. | It is process-shaped but outcome-calibrated; if the search target is outcome Elo, the rubric can become a proxy for historical outcomes. |
| Step-wise rubric reward | Attribute rubric items to specific steps and combine per-step rewards with final outcome reward. | Fixes the error of giving every token in a response the same scalar reward. | Requires high-quality step segmentation and reliable judge attribution. |
| Delayed live outcome RL | Store prediction-time agent trajectories, wait for resolution, backfill rewards, then replay for policy updates. | Prevents leakage by construction and aligns training to real outcomes. | Slow labels, operational complexity, and still vulnerable to noisy outcome rewards. |
3. What Echo appears to add
Echo combines four ideas that were previously scattered: live/future questions, prediction-point aligned Elo evaluation, automated rubric search over forecasting trajectories, and Map-Reduce agent decomposition. The WeChat/QbitAI article describes rubrics such as precursor/external-catalyst evaluation and multi-factor causal synthesis: the system checks whether the model identifies concrete forward-looking catalysts, links them to historical correlations, and integrates independent factors into a causal probability judgment.
The important caveat is circularity. Echo's public description says candidate rubrics are selected by maximizing the correlation between rubric-based rankings and outcome-based Elo rankings. That means the final training signal is not purely independent process truth. It is a process proxy tuned against realized outcomes. This may be useful, but it must be tested against outcome-only RL, rubric-only RL, and hybrid process-plus-Brier RL on fresh markets.
What Our Paper Could Add Beyond The Existing Literature
The clean contribution is not just another trained forecaster. The publishable gap is to test whether reasoning-process rewards actually improve tradable event forecasting once leakage, market prices, and realistic execution are controlled.
| Existing literature | What it has done | Remaining gap | Our contribution |
|---|---|---|---|
| Forecasting systems | RAG, aggregation, dynamic benchmarks, OpenForesight-style outcome RL. | Mostly evaluates Brier/accuracy, not whether process rewards improve market alpha. | Run outcome-only, rubric-only, and hybrid models on the same timestamped Kalshi/Polymarket split. |
| Echo / Train-on-Future | Proposes live future questions, point-aligned Elo, automated rubric search, and Map-Reduce agents. | Not peer-reviewed; no public EchoZ weights found; rubric search is outcome-calibrated. | Reproduce the idea transparently with hosted data, held-out prospective questions, and ablations. |
| Process-supervision papers | Show that step-level or rubric rewards can reduce false-positive reasoning in math/open-ended tasks. | They do not solve stochastic world outcomes, market prices, or temporal leakage. | Design forecasting-specific rubrics for base rates, catalysts, causal synthesis, source dating, and calibrated uncertainty. |
| Leakage papers | Use dated corpora, prospective questions, and claim-level audits. | Leakage audits rarely apply to the training reward itself. | Audit whether the rubric rewards leaked claims, not just whether final answers were contaminated. |
| Prediction-market papers | Show models can be calibrated poorly and can lose money despite plausible forecasts. | They rarely ask which reasoning failures cause bad trades. | Connect reasoning-rubric dimensions to Brier Skill Score, expected value, P&L, drawdown, and deference to market price. |
Recommended experimental design: start with Qwen3-8B/OpenForecaster-8B, build forecasting rubrics from resolved training markets, then compare four systems: base agent, outcome-RL agent, process-rubric agent, and hybrid process+Brier agent. Final claims should be made only on prospective or strictly point-in-time market questions, with market price as a mandatory baseline.
Prediction Markets: What Changes When Events Are Tradable
Prediction-market forecasting is not just another Brier benchmark. A market event has prices, bid/ask spreads, depth, fees, market makers, resolution rules, and trader selection. A model can improve Brier and still lose money if it trades too late, crosses wide spreads, overbets low-liquidity contracts, or follows public prices without adding information.
| Paper | What it did | Lesson |
|---|---|---|
| KalshiBench | 300 temporally filtered Kalshi questions, five frontier models. | Overconfidence is widespread; calibration is a distinct capability. |
| Prophet Arena | 1,367 resolved events and 72,136 markets with common contexts. | Evaluate Brier, calibration, and economic value together. |
| Prediction Arena | Autonomous agents trading with real capital on Kalshi and Polymarket. | Live trading exposes model, execution, and venue weaknesses. |
| PolyBench | 38,666 Polymarket markets with CLOB snapshots and news. | Order-book simulation and slippage are core evaluation components. |
| TimeSeek | 150 Kalshi markets at five lifecycle checkpoints, with/without search. | LLMs add the most value early and in high-uncertainty regimes. |
| PolySwarm / Evidence Markets | Multi-agent trading and limits of evidence aggregation. | Reflexivity, manipulation, and resolution ambiguity are open problems. |
Strategic implication: the differentiated project is not simply "train a better model." It is to measure when a model adds information beyond price, when it should defer to the market, and whether that edge survives realistic execution.
Training Strategy
Training is feasible at 8B scale, but the decision should be gated by baselines. The right default is to build an agentic forecaster and evaluator first, then train only where the evaluator shows model-specific errors that post-training can plausibly fix.
| Option | Pros | Cons | When to use |
|---|---|---|---|
| No training: agent + ensemble | Fast, cheap, strong baseline; can use BLF-style belief states and calibration. | Depends on API models; not a proprietary asset. | First milestone. |
| Closed/API model improvement | Improves forecasts through frozen retrieval, personas, self-consistency, market-price gating, calibration, and rubric feedback without changing weights. | Harder to own as a model artifact; API providers can change model behavior. | Run in parallel with dataset construction and use outputs to create training data. |
| LoRA/SFT on 8B | Cheap and stable; teaches format, base-rate discipline, and calibration language. | Limited true forecasting gain if data are weak. | After evaluation stack exists. |
| 8B GRPO/ReMax RL | Closest to OpenForesight and outcome-RL recipes. | ~1,000 H100-hr final run; reward and leakage risks. | After baselines show clear training target. |
| Market-specific RL | Targets tradable events directly. | P&L reward is noisy and overfit-prone; market impact issues. | Only after probability calibration works. |
| 30B-A3B or larger | Potentially stronger reasoning and retrieval synthesis. | Multi-node complexity and six-figure ablation budget. | Only if 8B beats market/ensemble baselines cleanly. |
What We Should Do, Step By Step
Tradable binary/multi-outcome events on Kalshi and Polymarket, plus open-ended event questions for broader generalization.
Store market prices, bid/ask, CLOB depth, volume, fees, resolution text, close/resolution dates, and snapshots. Do not rely on final market pages.
Use static snapshots of news, official releases, SEC/EDGAR, FRED/BLS/BEA, sports APIs, and market pages. Hash documents and retain publication/retrieval timestamps.
Brier, log score, calibration curves, ECE, market-price Brier Skill Score, P&L with fees/spreads/slippage, drawdown, and coverage-risk curves.
Market price, market price plus calibration transforms, OpenForecaster-8B, frontier single models, frontier ensembles, and BLF-style agentic search with logit aggregation.
Start with source cutoff checks, answer alias filtering, and post-cutoff phrase detection. Add claim-level Shapley-DCLR-style audit for final experiments.
Reproduce OpenForesight/OpenForecaster-style training on Qwen3-8B with structured probability outputs, random retrieved chunks, Accuracy+Brier reward, no std-normalized advantages, and strict formatting guardrails.
Use Brier/log rewards and calibration losses on resolved tradable events. Treat P&L as a holdout selection metric, not the first reward.
Only scale to 30B-A3B if 8B results show robust leakage-clean alpha or if the scientific contribution requires open weights at larger scale.
Research Questions Worth Writing A Paper Around
- Calibration to profit: characterize when a calibrated probability forecaster produces positive expected return under CLOB fees, spreads, and depth.
- Selective deference: learn a gate that chooses market price, model, search-agent, or abstain by horizon, liquidity, category, and uncertainty.
- Leakage-robust backtesting: adapt Shapley-DCLR to prediction-market rationales and open-ended forecasts.
- Cross-venue price discovery: measure Kalshi vs Polymarket lead-lag and whether LLMs exploit or merely follow it.
- Reflexivity: test whether public AI forecasts alter prices and degrade future training labels.
- Aggregation vs training: compare OpenForesight-style training to a cost-matched ensemble of frontier models.
Chronological Stream Matrix and Hosted PDF Library
The table is ordered chronologically by first public paper date. It classifies each paper into a stream and states what the paper solved on top of the previous literature. I removed the older pre-LLM market-theory papers from this visible ranked library so the matrix focuses on modern LLM forecasting, datasets, rewards, social forecasting, and API/system improvement.
| Stream | Date | Venue / publication | Paper | What it solves on top of previous work | |
|---|---|---|---|---|---|
| Dataset Construction | Jun 2022 | NeurIPS 2022 Datasets and Benchmarks | Autocast | Introduces dated event-forecasting questions and a time-indexed news corpus, making backcasting possible instead of ordinary QA on resolved facts. | |
| API/System Improvement | Jul 2022 | arXiv preprint | Language Models (Mostly) Know What They Know | Adds the idea that models may expose usable uncertainty, which later supports API-level calibration and confidence elicitation. | |
| API/System Improvement | Oct 2022 | ICLR 2023 | ReAct | Moves from static prompting to interleaved reasoning and action, the basic pattern for search-based forecasting agents. | |
| Process-Based Rewards | Nov 2022 | arXiv preprint | Solving Math Word Problems with Process- and Outcome-Based Feedback | Separates process supervision from outcome supervision and frames why final-answer reward is weak credit assignment. | |
| API/System Improvement | Feb 2023 | NeurIPS 2023 | Toolformer | Shows models can learn tool use, extending ReAct from prompting into trainable tool-augmented behavior. | |
| API/System Improvement | May 2023 | arXiv preprint | Just Ask for Calibration | Shows verbalized probabilities can improve calibration, a direct method for closed API forecasting systems. | |
| Process-Based Rewards | May 2023 | arXiv preprint | Let's Verify Step by Step | Demonstrates that step-level reward models can outperform outcome-only reward models, strengthening the case for reasoning rewards. | |
| Social Forecasting | Oct 2023 | arXiv preprint | Large Language Model Prediction Capabilities | Tests GPT-4 against a live human forecasting tournament, establishing that raw frontier APIs underperform human crowds. | |
| Process-Based Rewards | Dec 2023 | arXiv preprint | Math-Shepherd | Automates process labels with rollouts, reducing dependence on expensive human step annotations. | |
| Process-Based Rewards | Feb 2024 | arXiv technical report | DeepSeekMath / GRPO | Introduces GRPO-style critic-free RL, later reused by forecasting post-training recipes. | |
| Social Forecasting | Feb 2024 | ACM TOIIS; final Feb 2025 | AI-Augmented Predictions | Shows LLM assistance can improve human forecasting, moving beyond model-only evaluation. | |
| Social Forecasting | Feb 2024 | NeurIPS 2024 | Approaching Human-Level Forecasting with Language Models | Combines retrieval, reasoning, and aggregation to approach human crowd performance, showing system design beats raw prompting. | |
| Social Forecasting | Feb 2024 | Science Advances; final Nov 2024 | Wisdom of the Silicon Crowd | Shows model ensembles can rival human crowds and that human/model aggregation is a powerful low-compute baseline. | |
| Dataset Construction | May 2024 | arXiv preprint | Freshbench / Is Your LLM Outdated? | Pushes evaluation toward fresh, post-cutoff questions so benchmark results are not stale memorization tests. | |
| API/System Improvement | Jun 2024 | arXiv preprint | Can Language Models Use Forecasting Strategies? | Tests whether prompting can elicit superforecasting heuristics, clarifying limits of API-only improvement. | |
| Dataset Construction | Jul 2024 | arXiv preprint | MIRAI | Adds agentic event forecasting with structured historical events and news, broadening datasets beyond static text questions. | |
| Dataset Construction | Sep 2024 | ICLR 2025 | ForecastBench | Introduces a dynamic unresolved-question benchmark, solving the core leakage problem by waiting for future resolutions. | |
| Dataset Construction | Jan 2025 | arXiv preprint | Navigating Tomorrow | Studies pre/post-cutoff behavior, making model cutoff itself an explicit evaluation variable. | |
| Process-Based Rewards | Jan 2025 | arXiv technical report | DeepSeek-R1 | Shows large-scale RLVR can improve reasoning while warning that reward models and verifiers can be gamed. | |
| Dataset Construction | Jan 2025 | COLING 2025 | OpenForecast | Expands forecasting from binary questions to large-scale open-ended event decomposition and semantic resolution. | |
| Process-Based Rewards | Feb 2025 | arXiv preprint | LLMs Can Teach Themselves to Better Predict the Future | Uses self-play reasoning traces ranked by eventual outcomes, showing post-training can improve forecasting without human-written rationales. | |
| API/System Improvement | Mar 2025 | COLM 2025 | Search-R1 | Trains search as part of reasoning and masks retrieved tokens, separating model behavior from copied evidence. | |
| Process-Based Rewards | Mar 2025 | arXiv open RL report | DAPO | Improves GRPO-style RL infrastructure with stability tricks later applicable to forecasting RL. | |
| Process-Based Rewards | May 2025 | TMLR; published Nov 2025 | Outcome-Based RL to Predict the Future | Adapts RL to noisy delayed binary outcomes and shows calibration/profit can improve, but still depends on outcome rewards. | |
| Dataset Construction | May 2025 | arXiv preprint | ExAnte | Formalizes ex-ante inference and evaluates whether models recall post-cutoff outcomes rather than forecasting. | |
| Dataset Construction | Jun 2025 | arXiv preprint | Pitfalls in Evaluating LM Forecasters | Audits evaluation methodology and explains why static benchmark gains can overstate real forecasting skill. | |
| API/System Improvement | Jun 2025 | arXiv preprint | Prompt Engineering LLM Forecasting Capabilities | Systematically tests prompt variants, showing API-level gains are modest and must be measured rather than assumed. | |
| Dataset Construction | Jun 2025 | arXiv preprint | Bench to the Future | Builds frozen-corpus pastcasting so researchers can iterate faster than prospective benchmarks allow. | |
| Process-Based Rewards | Jul 2025 | ICLR 2026 | Beyond Binary Rewards / RLCR | Adds calibrated Brier-style reward to correctness, addressing overconfident outcome-only RL. | |
| Process-Based Rewards | Jul 2025 | arXiv preprint | Rubrics as Rewards | Uses structured rubrics as RL rewards for open-ended tasks, directly motivating Echo-style process scoring. | |
| Process-Based Rewards | Jul 2025 | arXiv preprint | Checklists Are Better Than Reward Models | Shows instruction-specific checklists can outperform fixed scalar reward models, supporting task-specific forecasting rubrics. | |
| Process-Based Rewards | Jul 2025 | arXiv preprint | Advancing Event Forecasting through Massive Training of LLMs | Scales event-forecasting post-training, shifting the literature from evaluation to direct model improvement. | |
| Dataset Construction | Aug 2025 | arXiv preprint | FutureX | Operationalizes live, continuously updated agent evaluation over many models and event sources. | |
| Process-Based Rewards | Aug 2025 | arXiv technical report | Reinforcement Learning with Rubric Anchors | Shows rubric-based RL can train a Qwen-30B-A3B open-ended reasoning model, relevant to UniScientist-style training. | |
| API/System Improvement | Aug 2025 | NeurIPS 2025 | ConfTuner | Turns verbalized confidence into a trainable Brier-style objective, bridging API confidence and post-training. | |
| Process-Based Rewards | Oct 2025 | arXiv preprint | OpenRubrics | Scales synthetic rubric generation, solving the manual-rubric bottleneck for process-reward training. | |
| Process-Based Rewards | Oct 2025 | arXiv preprint | Curing Miracle Steps | Targets false positives where correct answers arise from invalid reasoning, the exact analogue of lucky forecasts. | |
| Dataset Construction | Oct 2025 | arXiv preprint | Prophet Arena | Adds common-context prediction-market evaluation with market prices, linking forecasting benchmarks to tradable events. | |
| Process-Based Rewards | Nov 2025 | arXiv preprint | Reward and Guidance through Rubrics / RGR-GRPO | Combines rubric reward with offline guidance, improving exploration beyond ordinary verifiable-reward RL. | |
| Dataset Construction | Dec 2025 | arXiv preprint | KalshiBench | Uses resolved Kalshi questions after model cutoffs to measure overconfidence on regulated prediction-market events. | |
| Dataset Construction | Dec 2025 | arXiv preprint | OpenForesight / OpenForecaster | Combines synthetic open-ended forecast data, frozen retrieval, Accuracy+Brier reward, and a public 8B trained model. | |
| Process-Based Rewards | Feb 2026 | arXiv preprint | All Leaks Count / Shapley-DCLR | Moves leakage analysis from dataset-level hygiene to claim-level decision impact, enabling leakage-weighted rewards/evaluation. | |
| Social Forecasting | Apr 2026 | arXiv preprint | PolySwarm | Tests multi-agent LLM trading behavior, expanding social forecasting into interacting market agents. | |
| Dataset Construction | Apr 2026 | 2nd ICLR Workshop on Advances in Financial AI | TimeSeek | Evaluates Kalshi markets at multiple lifecycle checkpoints and shows when search/API models help or hurt. | |
| Dataset Construction | Apr 2026 | arXiv preprint | Prediction Arena | Uses real capital on live Kalshi/Polymarket agents, making execution and realized returns part of evaluation. | |
| Dataset Construction | Apr 2026 | arXiv preprint | PolyBench | Adds CLOB snapshots, news, and simulated returns, solving the gap between Brier evaluation and tradable performance. | |
| Social Forecasting | Apr 2026 | arXiv preprint | Agentic Forecasting using Sequential Bayesian Updating | Introduces belief-state updates and aggregation as a principled alternative to one-shot model forecasts. | |
| Dataset Construction | Apr 2026 | arXiv preprint | BTF-2 | Extends frozen-corpus pastcasting with a larger corpus and reasoning traces, improving reproducibility. | |
| Process-Based Rewards | Apr 2026 | arXiv preprint | FutureWorld | Closes the loop from live questions to delayed real-world rewards, contrasting outcome backfill with process-only rewards. | |
| Process-Based Rewards | May 2026 | arXiv preprint | Teaching LLMs When Not to Know | Trains/prompt-critiques models to reject post-cutoff knowledge, supporting temporal admissibility in reasoning rewards. | |
| Process-Based Rewards | May 2026 | arXiv preprint | Step-wise Rubric Rewards | Attributes rubric items to individual reasoning steps, fixing the problem of applying one scalar rubric reward to all tokens. | |
| Social Forecasting | May 2026 | arXiv preprint | Multi-Agent AI Oracles for Prediction Market Resolution | Uses multiple agents for market resolution, extending social forecasting to adjudication and settlement. | |
| API/System Improvement | Jun 2026 | arXiv preprint | LLMs Are Overconfident | Consolidates evidence that reasoning/API models remain overconfident, reinforcing calibration as a separate improvement target. | |
| Social Forecasting | Jun 2026 | arXiv preprint | Evidence Markets | Analyzes limits of markets as evidence aggregators, warning that social forecasting systems also need structure and governance. |