Buck comments on Why it’s hard to make settings for high-stakes control research

Buck 19 Jul 2025 4:20 UTC
LW: 4 AF: 2
0
AF
For control evaluations, it doesn’t matter whether the models can tell it’s an eval or not—we’re not trying to evaluate their propensity to do bad things, we’re evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they’re in some kind of eval).