# GLM-5.2 OpenCode Evaluation: Kalshi Filtered Pilot

Updated: July 1, 2026

## Summary

I ran `zai-coding-plan/glm-5.2` through the OpenCode CLI on a partial set of resolved Kalshi markets. We stopped after 480 accepted forecasts because the goal was an operational pilot, not a full benchmark run.

This is a useful first check of the evaluation stack, but it is not a leakage-clean forecasting result. GLM-5.2 may already know some 2025 outcomes through parametric memory, and the sample is the first sequential 480 rows of `kalshi_filtered`, not a randomized holdout.

## Setup

| Item | Value |
| --- | --- |
| Model | `zai-coding-plan/glm-5.2` |
| Runner | OpenCode CLI 1.17.9 |
| Dataset | `data/raw/kalshi_filtered/data/train-00000-of-00001.parquet` |
| Completed rows | 480 of 1,433 |
| Sample | Sequential rows `kf_0000` through `kf_0479` |
| Market baseline | `price_at_cutoff / 100` |
| Prompting | Closed-book; market price withheld; instructed not to use tools/web/post-cutoff information |
| Scoring | Brier, log score, accuracy, ECE, market-skill score, simple edge backtest |

OpenCode command pattern:

```bash
opencode run -m zai-coding-plan/glm-5.2 --format json "<batch forecast prompt>"
```

The model was required to output:

```json
{
  "forecasts": [
    {
      "row_id": "kf_0000",
      "prob_yes": 0.42,
      "confidence": "low|medium|high",
      "rationale": "short"
    }
  ]
}
```

Rows were accepted only if every expected `row_id` appeared exactly once with a probability in `[0, 1]`.

## Aggregate Results

| Metric | GLM-5.2 | Kalshi cutoff market | Interpretation |
| --- | ---: | ---: | --- |
| N | 480 | 480 | Partial pilot subset |
| Brier score | 0.2475 | 0.1931 | GLM worse; lower is better |
| Log score | 0.7214 | 0.5725 | GLM worse; lower is better |
| Accuracy at 0.5 | 61.46% | 69.79% | GLM worse |
| Expected calibration error | 0.1857 | 0.1656 | GLM worse |
| Brier skill vs market | -28.15% | baseline | Negative means worse than market |

Main result: as a text-only forecaster with no market price, GLM-5.2 underperformed the Kalshi cutoff market price on proper scoring and classification accuracy.

## Naive Edge Backtest

The simple one-share backtest trades YES when `model_prob - market_prob >= 0.05` and NO when `market_prob - model_prob >= 0.05`.

| Metric | Value |
| --- | ---: |
| Trades | 406 |
| YES trades | 131 |
| NO trades | 275 |
| Gross P&L | +17.57 units |
| Net P&L before fees/slippage | +17.57 units |
| Mean P&L / trade | +0.0433 |
| Hit rate | 48.03% |
| Average absolute edge | 0.2589 |

This positive P&L should be treated cautiously. It ignores fees, spreads, order-book depth, partial fills, liquidity, and sample selection. It also coexists with worse Brier/log scores, which means the model is not globally better calibrated than the market. The likely interpretation is that this partial sample contains some large model-market disagreements that happened to pay off, not that GLM-5.2 is ready as a trading system.

## Calibration

GLM-5.2 was overconfident in high-probability YES bins on this subset. For example:

| Probability bin | Count | Avg predicted YES | Empirical YES | Gap |
| --- | ---: | ---: | ---: | ---: |
| 0.0-0.1 | 88 | 0.045 | 0.273 | 0.228 |
| 0.4-0.5 | 92 | 0.434 | 0.261 | 0.173 |
| 0.5-0.6 | 74 | 0.541 | 0.297 | 0.244 |
| 0.7-0.8 | 15 | 0.751 | 0.333 | 0.418 |
| 0.9-1.0 | 7 | 0.930 | 0.429 | 0.501 |

The high-confidence bins are small, but they show the core issue: the model often assigned strong probabilities without enough calibration.

## Horizon Breakdown

| Days until resolution bucket | N | Avg days | GLM Brier | Market Brier | Skill vs market |
| --- | ---: | ---: | ---: | ---: | ---: |
| 1.04-2.25 | 124 | 1.88 | 0.2579 | 0.1738 | -48.36% |
| 2.25-3.42 | 117 | 2.79 | 0.2738 | 0.2132 | -28.40% |
| 3.42-5.92 | 166 | 4.66 | 0.2177 | 0.1801 | -20.85% |
| 5.92-6.96 | 73 | 6.76 | 0.2553 | 0.2231 | -14.43% |

The model underperformed the market in every horizon bucket. The gap was largest at the shortest horizons, where market prices are usually most information-rich.

## Operational Notes

The OpenCode CLI worked, but it is not ideal as a high-throughput evaluator:

- 20-row and 50-row batches produced parseable JSON.
- One 50-row batch omitted a row id and was correctly rejected/retried.
- One OpenCode call hung and had to be interrupted.
- We resumed safely from `predictions.csv` because the runner checkpoints after every accepted batch.

For larger runs, direct provider API access would be cleaner and faster. OpenCode is still useful for a quick agent/CLI parity check.

## Artifacts

- Metrics JSON: [metrics.json](evals/glm52_kalshi_filtered/metrics.json)
- Predictions CSV: [predictions.csv](evals/glm52_kalshi_filtered/predictions.csv)
- Enriched predictions CSV: [predictions_enriched.csv](evals/glm52_kalshi_filtered/predictions_enriched.csv)
- Calibration CSV: [calibration.csv](evals/glm52_kalshi_filtered/calibration.csv)
- Horizon metrics CSV: [horizon_metrics.csv](evals/glm52_kalshi_filtered/horizon_metrics.csv)
- Category metrics CSV: [category_metrics.csv](evals/glm52_kalshi_filtered/category_metrics.csv)
- Raw accepted OpenCode JSONL/text outputs: [raw_opencode/](evals/glm52_kalshi_filtered/raw_opencode/)

## Takeaways

1. The evaluation pipeline is working end to end: OpenCode call, JSON validation, scoring, calibration, and website reporting.
2. GLM-5.2 text-only forecasting does not beat Kalshi market prices on proper scoring in this pilot.
3. Market price remains the first baseline to beat.
4. The next serious evaluation should use a randomized and leakage-audited sample, direct API calls, multiclass scoring, and a point-in-time retrieval layer.
