Sam Marks comments on Towards Alignment Auditing as a Numbers-Go-Up Science

Sam Marks 4 Aug 2025 23:41 UTC
6 points
3
Can you sketch out why you think that control is more amenable to property 2 than alignment is? (By “more amenable to property 2” I mean “all else equal, researchers trying to build high-quality evaluations will be more likely to succeed at building evaluations with property 2 for control than for alignment.”)
My best guess for why you think this is something like:
Control evaluations are largely tests of certain capabilities. As with other capabilities evaluations, you can just run your static evaluation set on new model generations without needing to substantially adapt the evaluation in response to ways non-capabilities ways that the new models are different. In other words, if new models have changes in architecture, training process, etc. that doesn’t matter for your evaluations except for insofar as they lead to stronger capabilities.
(Once I understand your view better I’ll try to say what I think about it.)