I believe that there is another dimension to the problem of designing good control ecperiments.
I’m doing some research on LLMs being aware when they are inside evals, and it seems like frontier models are quite good at recognizing it for the alignment evals I could get a hand on. And this affects the model behavior a lot! In the Anthropic blackmail experiment models blackmailed much less when they mentioned that they are inside evals compared to the cases when they didn’t think they are inside evals.
I’ve tried to make evaluation scenarios more realistic, but it’s a challenge. Models are quite good at “smelling” being inside an eval no matter how hard I try, often the mere fact that they are given some unusual task that requires a covert mislaigned action makes them believe that they are being evaluated.
For control evaluations, it doesn’t matter whether the models can tell it’s an eval or not—we’re not trying to evaluate their propensity to do bad things, we’re evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they’re in some kind of eval).
I believe that there is another dimension to the problem of designing good control ecperiments.
I’m doing some research on LLMs being aware when they are inside evals, and it seems like frontier models are quite good at recognizing it for the alignment evals I could get a hand on. And this affects the model behavior a lot! In the Anthropic blackmail experiment models blackmailed much less when they mentioned that they are inside evals compared to the cases when they didn’t think they are inside evals.
I’ve tried to make evaluation scenarios more realistic, but it’s a challenge. Models are quite good at “smelling” being inside an eval no matter how hard I try, often the mere fact that they are given some unusual task that requires a covert mislaigned action makes them believe that they are being evaluated.
For control evaluations, it doesn’t matter whether the models can tell it’s an eval or not—we’re not trying to evaluate their propensity to do bad things, we’re evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they’re in some kind of eval).