# Dataset Repository and Evaluation Pipeline

Updated: July 1, 2026

## Local Dataset Status

The local downloader has pulled the manageable public event-forecasting and prediction-market datasets into `data/raw/`.

- Downloaded or cloned: 33 datasets/code repositories.
- Local footprint: 8.31 GiB.
- Failed downloads: 0.
- Skipped by policy: 5.
- Download command: `python3 scripts/download_datasets.py --max-size-gb 10`.
- Full manifest: `manifests/dataset_inventory.md`.
- Schema profile: `manifests/dataset_profile.md`.

Large Polymarket mirrors are cataloged but skipped by default:

- `SII-WANGZJ/Polymarket_data`: 159.11 GiB.
- `trentmkelly/polymarket_crypto_derivatives`: 17.90 GiB and 53k files.
- `aliplayer1/polymarket-crypto-updown`: 27.08 GiB.
- `moose-code/polymarket-onchain-v1`: 118.45 GiB.
- OpenForecast is listed as manual because the dataset is on Google Drive.

## High-Value Downloaded Sources

Dataset construction:

- OpenForesight: questions, answers, resolution criteria, article timestamps, generated forecasts.
- ForecastBench: dynamic benchmark datasets and human/LLM forecast sets.
- FutureX Past/Online: resolved and live dynamic future-prediction questions.
- KalshiBench v1/v2: Kalshi questions with `ground_truth` and `market_probability`.
- Prophet Arena subsets: market events, market data, submissions, outcomes, and sources.
- Autocast, MIRAI, PolyBench, PROPHET, and Halawi et al. repositories.

Market data:

- Kalshi markets and trade mirrors.
- Kalshi close snapshots and forecast snapshots.
- Polymarket small/medium mirrors, closed-market samples, minute parquet, 5-minute crypto up/down markets, and cross-venue Polymarket/Kalshi samples.

Social forecasting:

- Metaculus binary datasets and timestamped forecast snapshots.
- Halawi et al. forecasting agent code/data repository.

## First Evaluation Pipeline

The first local evaluation package is in `forecast_eval/`.

Supported now:

- Binary forecast scoring from CSV/JSON/JSONL/parquet.
- Brier score.
- Log score.
- Accuracy at 0.5.
- Calibration bins and expected calibration error.
- Market-price Brier skill score.
- Edge-triggered one-share market backtest against a market probability baseline.

Sample command:

```bash
python3 -m forecast_eval.cli \
  --predictions examples/predictions_sample.csv \
  --out data/eval/sample_metrics.json \
  --calibration-out data/eval/sample_calibration.csv \
  --bins 5 \
  --edge-threshold 0.05
```

Validation run:

- `python3 -m unittest tests/test_metrics.py` passed.
- Sample output: model Brier 0.12105, market Brier 0.16765, market skill 0.2780 on the toy file.

## Next Engineering Step

The evaluation pipeline should now be extended with dataset adapters:

1. Normalize KalshiBench into `question_id`, `forecast_time`, `model_prob`, `market_prob`, `outcome`, `category`, `source`.
2. Normalize OpenForesight into ex-ante question/answer records and keep article publication dates for leakage checks.
3. Normalize Prophet Arena and ForecastBench into comparable probability records.
4. Build market snapshots from Kalshi/Polymarket mirrors with bid/ask/depth/volume/fees.
5. Add multiclass proper scoring, because Echo and Prophet Arena questions are often not binary.
6. Add point-in-time retrieval cutoffs before any modeling run.
