Open benchmark
Frozen pairs. Source-disjoint folds. Calibration on the scoreboard.
Most published comparison algorithms report discrimination (AUC, error rates) on splits of their own choosing. This benchmark freezes the comparison pairs and the source-disjoint evaluation folds — committed to by a cryptographic split hash — and scores what the field does not otherwise measure: how honest the reported likelihood ratios are (calibration loss, Cllr − Cllr_min), alongside total Cllr and AUC.
Frozen + content-addressed
Every pair is identified by the SHA-256 hashes of its marks; the split hash commits to pairs, labels, and folds. Hash equality means same benchmark — no silent re-splitting.
Source-disjoint by construction
A fold's test pairs involve only held-out barrels, slides, or tool edges. The contract: calibrate each pair's LR without using labels from either of its sources.
Replicable offline
The kit ships the frozen pairs, folds, provenance, the scorer, and a standalone evaluate.py whose output equals the leaderboard scoring exactly.
Bullet lands, pooled (Hamby-252 & 173, PGPD Beretta, Phoenix)
Bullet lands · striated- Pairs
- 1,901
- Same-source
- 146
- Sources
- 38
- Frozen folds
- 10
- Split hash
- 7aa24c56cf76…
| # | Method | Cllr (rank) | Cllr_min | Calibration loss | AUC | Date |
|---|---|---|---|---|---|---|
| 1 | bullet-contrastreference Verity | 0.205 ±0.13 | 0.136 | +0.069 | 0.979 | 2026-06-11 |
Submit your method
curl -X POST https://data.verity.codes/benchmark/splits/bullets-v1/submissions \
-H 'Content-Type: application/json' \
-d '{"submitter": "you", "method": "your-method", "url": "https://…",
"csv": "pair_id,lr\n<one LR per frozen pair>"}'One finite, positive likelihood ratio per frozen pair. The kit's evaluate.py scores your submission offline with the identical code — what you see locally is what the leaderboard records.
Cartridge breech faces (Fadul, 10 consecutively-manufactured slides)
Cartridge breech faces · impressed- Pairs
- 190
- Same-source
- 10
- Sources
- 10
- Frozen folds
- 10
- Split hash
- 31aea8c30185…
| # | Method | Cllr (rank) | Cllr_min | Calibration loss | AUC | Date |
|---|---|---|---|---|---|---|
| 1 | cmr-2dreference Verity | 0.398 ±0.20 | 0.282 | +0.116 | 0.922 | 2026-06-11 |
Submit your method
curl -X POST https://data.verity.codes/benchmark/splits/cartridge-v1/submissions \
-H 'Content-Type: application/json' \
-d '{"submitter": "you", "method": "your-method", "url": "https://…",
"csv": "pair_id,lr\n<one LR per frozen pair>"}'One finite, positive likelihood ratio per frozen pair. The kit's evaluate.py scores your submission offline with the identical code — what you see locally is what the leaderboard records.
Screwdriver toolmarks (tmaRks, consecutively manufactured; tool edges)
Screwdriver toolmarks · striated- Pairs
- 167,332
- Same-source
- 3,530
- Sources
- 56
- Frozen folds
- 10
- Split hash
- 4ead8b698bee…
| # | Method | Cllr (rank) | Cllr_min | Calibration loss | AUC | Date |
|---|---|---|---|---|---|---|
| 1 | cmr-1dreference Verity | 0.330 ±0.05 | 0.309 | +0.021 | 0.943 | 2026-06-11 |
Submit your method
curl -X POST https://data.verity.codes/benchmark/splits/toolmark-v1/submissions \
-H 'Content-Type: application/json' \
-d '{"submitter": "you", "method": "your-method", "url": "https://…",
"csv": "pair_id,lr\n<one LR per frozen pair>"}'One finite, positive likelihood ratio per frozen pair. The kit's evaluate.py scores your submission offline with the identical code — what you see locally is what the leaderboard records.
The ground-truth labels are public (the underlying scans are open data), so this is a replication benchmark, not a blind contest: the leaderboard ranks by total Cllr — the proper scoring rule — and the submission contract asks for source-disjoint calibration, the same discipline Verity's own reference rows follow. Verity's baselines are leave-the-pair's-sources-out calibrated and reproducible from the public catalog with verity-build-benchmark.