Dusto comments on We should try to automate AI safety work asap

Dusto 26 Apr 2025 23:38 UTC
4 points
2
I see notes of it sprinkled throughout this piece, but is there any consideration for how people can put more focused effort on meta-evals, or exam security? (random mumblings below)
Treating the metric like published software:
- Versioned specs: Write a formal doc (task, scoring rule, aggregation, allowed metadata). Cut a new major version whenever the metric or dataset changes so scores are never cross-compared accidentally.
- Reproducible containers: Make the judge/grader a Docker/OCI image whose hash is pinned inside the leaderboard. Any run is signed by the image plus a tamper-proof log of model outputs.
- Public “metric change log”: Every time a bug-fix or leakage patch ships, publish a diff+reason. That both discourages quiet back-padding and helps meta-evaluators regress-test old claims.
Adversarial red-team for the metric itself:
- Thread modelling the judge
- Agents and humans try to break metric
- Patch and re-test found exploits
- Keep private slice of test to compare for drift, and signs of gaming
Layering:
- Use 3-to-5 models from different labs, different prompts, majority vote, etc. Randomly subsample on each evaluation run so attackers cannot tune to a fixed model
- Spot checks. 1-2% model x prompt x judge, are routed to human annotators.
People and incentives:
- Bug bounties for demonstrations where score inflates without improving ground-truth safety/capability measure
- Audit rotations. Have different labs re-implement metrics
- Leaderboard freezes. Scores are held for x days before release (beyond one red-team cycle), to minimise “press-release hacking”