Frozen pairs. Source-disjoint folds. Calibration on the scoreboard.

Most published comparison algorithms report discrimination (AUC, error rates) on splits of their own choosing. This benchmark freezes the comparison pairs and the source-disjoint evaluation folds — committed to by a cryptographic split hash — and scores what the field does not otherwise measure: how honest the reported likelihood ratios are (calibration loss, Cllr − Cllr_min), alongside total Cllr and AUC.

Frozen + content-addressed

Every pair is identified by the SHA-256 hashes of its marks; the split hash commits to pairs, labels, and folds. Hash equality means same benchmark — no silent re-splitting.

Source-disjoint by construction

A fold's test pairs involve only held-out barrels, slides, or tool edges. The contract: calibrate each pair's LR without using labels from either of its sources.

Replicable offline

The kit ships the frozen pairs, folds, provenance, the scorer, and a standalone evaluate.py whose output equals the leaderboard scoring exactly.

Bullet lands, pooled (Hamby-252 & 173, PGPD Beretta, Phoenix)

Bullet lands · striated

Pairs: 1,901
Same-source: 146
Sources: 38
Frozen folds: 10
Split hash: 7aa24c56cf76…

#	Method	Cllr (rank)	Cllr_min	Calibration loss	AUC	Date
1	bullet-contrastreference Verity	0.205 ±0.13	0.136	+0.069	0.979	2026-06-11

Download replication kit ↓

Submit your method

curl -X POST https://data.verity.codes/benchmark/splits/bullets-v1/submissions \
  -H 'Content-Type: application/json' \
  -d '{"submitter": "you", "method": "your-method", "url": "https://…",
       "csv": "pair_id,lr\n<one LR per frozen pair>"}'

One finite, positive likelihood ratio per frozen pair. The kit's evaluate.py scores your submission offline with the identical code — what you see locally is what the leaderboard records.

Cartridge breech faces (Fadul, 10 consecutively-manufactured slides)

Cartridge breech faces · impressed

Pairs: 190
Same-source: 10
Sources: 10
Frozen folds: 10
Split hash: 31aea8c30185…

#	Method	Cllr (rank)	Cllr_min	Calibration loss	AUC	Date
1	cmr-2dreference Verity	0.398 ±0.20	0.282	+0.116	0.922	2026-06-11

Download replication kit ↓

Submit your method

curl -X POST https://data.verity.codes/benchmark/splits/cartridge-v1/submissions \
  -H 'Content-Type: application/json' \
  -d '{"submitter": "you", "method": "your-method", "url": "https://…",
       "csv": "pair_id,lr\n<one LR per frozen pair>"}'

One finite, positive likelihood ratio per frozen pair. The kit's evaluate.py scores your submission offline with the identical code — what you see locally is what the leaderboard records.

Screwdriver toolmarks (tmaRks, consecutively manufactured; tool edges)

Screwdriver toolmarks · striated

Pairs: 167,332
Same-source: 3,530
Sources: 56
Frozen folds: 10
Split hash: 4ead8b698bee…

#	Method	Cllr (rank)	Cllr_min	Calibration loss	AUC	Date
1	cmr-1dreference Verity	0.330 ±0.05	0.309	+0.021	0.943	2026-06-11

Download replication kit ↓

Submit your method

curl -X POST https://data.verity.codes/benchmark/splits/toolmark-v1/submissions \
  -H 'Content-Type: application/json' \
  -d '{"submitter": "you", "method": "your-method", "url": "https://…",
       "csv": "pair_id,lr\n<one LR per frozen pair>"}'

One finite, positive likelihood ratio per frozen pair. The kit's evaluate.py scores your submission offline with the identical code — what you see locally is what the leaderboard records.

The ground-truth labels are public (the underlying scans are open data), so this is a replication benchmark, not a blind contest: the leaderboard ranks by total Cllr — the proper scoring rule — and the submission contract asks for source-disjoint calibration, the same discipline Verity's own reference rows follow. Verity's baselines are leave-the-pair's-sources-out calibrated and reproducible from the public catalog with verity-build-benchmark.