For control evaluations, it doesn’t matter whether the models can tell it’s an eval or not—we’re not trying to evaluate their propensity to do bad things, we’re evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they’re in some kind of eval).
For control evaluations, it doesn’t matter whether the models can tell it’s an eval or not—we’re not trying to evaluate their propensity to do bad things, we’re evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they’re in some kind of eval).