Igor Ivanov comments on Why it’s hard to make settings for high-stakes control research

Igor Ivanov 18 Jul 2025 23:03 UTC
LW: 4 AF: 2
0
AF
I believe that there is another dimension to the problem of designing good control ecperiments.
I’m doing some research on LLMs being aware when they are inside evals, and it seems like frontier models are quite good at recognizing it for the alignment evals I could get a hand on. And this affects the model behavior a lot! In the Anthropic blackmail experiment models blackmailed much less when they mentioned that they are inside evals compared to the cases when they didn’t think they are inside evals.

I’ve tried to make evaluation scenarios more realistic, but it’s a challenge. Models are quite good at “smelling” being inside an eval no matter how hard I try, often the mere fact that they are given some unusual task that requires a covert mislaigned action makes them believe that they are being evaluated.
- Buck 19 Jul 2025 4:20 UTC
  LW: 4 AF: 2
  0
  AF Parent
  For control evaluations, it doesn’t matter whether the models can tell it’s an eval or not—we’re not trying to evaluate their propensity to do bad things, we’re evaluating their capability to do so, so we tell the AIs to try to do the side tasks (which makes it obvious to them that they’re in some kind of eval).