I see notes of it sprinkled throughout this piece, but is there any consideration for how people can put more focused effort on meta-evals, or exam security? (random mumblings below)
Treating the metric like published software:
Versioned specs: Write a formal doc (task, scoring rule, aggregation, allowed metadata). Cut a new major version whenever the metric or dataset changes so scores are never cross-compared accidentally.
Reproducible containers: Make the judge/grader a Docker/OCI image whose hash is pinned inside the leaderboard. Any run is signed by the image plus a tamper-proof log of model outputs.
Public “metric change log”: Every time a bug-fix or leakage patch ships, publish a diff+reason. That both discourages quiet back-padding and helps meta-evaluators regress-test old claims.
Adversarial red-team for the metric itself:
Thread modelling the judge
Agents and humans try to break metric
Patch and re-test found exploits
Keep private slice of test to compare for drift, and signs of gaming
Layering:
Use 3-to-5 models from different labs, different prompts, majority vote, etc. Randomly subsample on each evaluation run so attackers cannot tune to a fixed model
Spot checks. 1-2% model x prompt x judge, are routed to human annotators.
People and incentives:
Bug bounties for demonstrations where score inflates without improving ground-truth safety/capability measure
Audit rotations. Have different labs re-implement metrics
Leaderboard freezes. Scores are held for x days before release (beyond one red-team cycle), to minimise “press-release hacking”
I see notes of it sprinkled throughout this piece, but is there any consideration for how people can put more focused effort on meta-evals, or exam security? (random mumblings below)
Treating the metric like published software:
Versioned specs: Write a formal doc (task, scoring rule, aggregation, allowed metadata). Cut a new major version whenever the metric or dataset changes so scores are never cross-compared accidentally.
Reproducible containers: Make the judge/grader a Docker/OCI image whose hash is pinned inside the leaderboard. Any run is signed by the image plus a tamper-proof log of model outputs.
Public “metric change log”: Every time a bug-fix or leakage patch ships, publish a diff+reason. That both discourages quiet back-padding and helps meta-evaluators regress-test old claims.
Adversarial red-team for the metric itself:
Thread modelling the judge
Agents and humans try to break metric
Patch and re-test found exploits
Keep private slice of test to compare for drift, and signs of gaming
Layering:
Use 3-to-5 models from different labs, different prompts, majority vote, etc. Randomly subsample on each evaluation run so attackers cannot tune to a fixed model
Spot checks. 1-2% model x prompt x judge, are routed to human annotators.
People and incentives:
Bug bounties for demonstrations where score inflates without improving ground-truth safety/capability measure
Audit rotations. Have different labs re-implement metrics
Leaderboard freezes. Scores are held for x days before release (beyond one red-team cycle), to minimise “press-release hacking”