Open benchmark

Frozen pairs. Source-disjoint folds. Calibration on the scoreboard.

Most published comparison algorithms report discrimination (AUC, error rates) on splits of their own choosing. This benchmark freezes the comparison pairs and the source-disjoint evaluation folds — committed to by a cryptographic split hash — and scores what the field does not otherwise measure: how honest the reported likelihood ratios are (calibration loss, Cllr − Cllr_min), alongside total Cllr and AUC.

Frozen + content-addressed

Every pair is identified by the SHA-256 hashes of its marks; the split hash commits to pairs, labels, and folds. Hash equality means same benchmark — no silent re-splitting.

Source-disjoint by construction

A fold's test pairs involve only held-out barrels, slides, or tool edges. The contract: calibrate each pair's LR without using labels from either of its sources.

Replicable offline

The kit ships the frozen pairs, folds, provenance, the scorer, and a standalone evaluate.py whose output equals the leaderboard scoring exactly.

Bullet lands, pooled (Hamby-252 & 173, PGPD Beretta, Phoenix)

Bullet lands · striated
Pairs
1,901
Same-source
146
Sources
38
Frozen folds
10
Split hash
7aa24c56cf76…
#MethodCllr (rank)Cllr_minCalibration lossAUCDate
1
Verity
0.205 ±0.130.136+0.0690.9792026-06-11
Download replication kit ↓
Submit your method
curl -X POST https://data.verity.codes/benchmark/splits/bullets-v1/submissions \
  -H 'Content-Type: application/json' \
  -d '{"submitter": "you", "method": "your-method", "url": "https://…",
       "csv": "pair_id,lr\n<one LR per frozen pair>"}'

One finite, positive likelihood ratio per frozen pair. The kit's evaluate.py scores your submission offline with the identical code — what you see locally is what the leaderboard records.

Cartridge breech faces (Fadul, 10 consecutively-manufactured slides)

Cartridge breech faces · impressed
Pairs
190
Same-source
10
Sources
10
Frozen folds
10
Split hash
31aea8c30185…
#MethodCllr (rank)Cllr_minCalibration lossAUCDate
1
cmr-2dreference
Verity
0.398 ±0.200.282+0.1160.9222026-06-11
Download replication kit ↓
Submit your method
curl -X POST https://data.verity.codes/benchmark/splits/cartridge-v1/submissions \
  -H 'Content-Type: application/json' \
  -d '{"submitter": "you", "method": "your-method", "url": "https://…",
       "csv": "pair_id,lr\n<one LR per frozen pair>"}'

One finite, positive likelihood ratio per frozen pair. The kit's evaluate.py scores your submission offline with the identical code — what you see locally is what the leaderboard records.

Screwdriver toolmarks (tmaRks, consecutively manufactured; tool edges)

Screwdriver toolmarks · striated
Pairs
167,332
Same-source
3,530
Sources
56
Frozen folds
10
Split hash
4ead8b698bee…
#MethodCllr (rank)Cllr_minCalibration lossAUCDate
1
cmr-1dreference
Verity
0.330 ±0.050.309+0.0210.9432026-06-11
Download replication kit ↓
Submit your method
curl -X POST https://data.verity.codes/benchmark/splits/toolmark-v1/submissions \
  -H 'Content-Type: application/json' \
  -d '{"submitter": "you", "method": "your-method", "url": "https://…",
       "csv": "pair_id,lr\n<one LR per frozen pair>"}'

One finite, positive likelihood ratio per frozen pair. The kit's evaluate.py scores your submission offline with the identical code — what you see locally is what the leaderboard records.

The ground-truth labels are public (the underlying scans are open data), so this is a replication benchmark, not a blind contest: the leaderboard ranks by total Cllr — the proper scoring rule — and the submission contract asks for source-disjoint calibration, the same discipline Verity's own reference rows follow. Verity's baselines are leave-the-pair's-sources-out calibrated and reproducible from the public catalog with verity-build-benchmark.