Sam Marks comments on Towards Alignment Auditing as a Numbers-Go-Up Science

Sam Marks 5 Aug 2025 20:52 UTC
LW: 5 AF: 3
2
AF
I agree that the hard/important part is the model organism construction. That said, I think having auditing agents is still a prerequisite for a bunch of the claims in this post. Auditing agents make auditing games (1) repeatable and (2) self-serve (in the sense that e.g. a single researcher could in principle run their own agent-based auditing game evaluation multiple times over the course of a project to see if things are going well). If auditing games were inherently single-use (because of cost + the fact that you can’t rerun the game on unspoiled auditors after you’ve run it once), then I couldn’t reasonably describe auditing game performance as a progress metric.
I think I also probably take the continuous score more seriously than you do (rather than viewing 0% vs. non-0% win rate as being the important thing), though maybe I would change my mind if I thought about it more. (I’m not sure whether this is an important disagreement.)