Rauno Arike comments on It’s hard to make scheming evals look realistic for LLMs

Rauno Arike 25 May 2025 12:44 UTC
16 points
3
Our attempts have only yielded limited results: LLM judges still considered scenarios as evaluations with probability ranging 0.80 – 0.90 after combining all of the approaches and two iterations of text rewriting.
Note that while the LLM’s self-reported probabilities are very useful for capturing its overall sentiment about being evaluated, they may not provide an accurate picture of the magnitude of improvement from realism-enhancing strategies. I have also worked on evaluation awareness evaluations myself in the past (though the project is not public unfortunately), and my impression is that the probability distribution for being evaluated given environments with different levels of realism tends to be highly bimodal for recent Claude models. As long as the models found my environment unrealistic, they consistently assigned probabilities of 0.7-0.9 for being in an eval, while once the environment became realistic enough, these probabilities dropped to the 0.1 to 0.3 range. In this light, a reduction of the average probability of being inside an eval from 0.95 to 0.83 may represent substantial progress toward the threshold where the model’s self-reported probability rapidly drops and thus be quite significant.
What links here?
- Rauno Arike's comment on nostalgebraist’s Shortform by nostalgebraist (1 Jun 2025 13:37 UTC; 10 points)
- khafra 28 May 2025 5:28 UTC
  6 points
  2
  Parent
  Linear probes give better results than text output for quantitative predictions in economics. They’d likely give a better calibrated probability here, too.
- Igor Ivanov 25 May 2025 15:16 UTC
  4 points
  0
  Parent
  Thanks! That’s an important comment. This report represents work in progress, and we will definetely take your words into account, Claude’s opinion on whether it’s being evaluated doesn’t necessarily changes linearly with the amount with “eval” properties.
  - Rauno Arike 25 May 2025 22:42 UTC
    3 points
    0
    Parent
    Good luck with your follow-up experiments, this is important work! I should also say that my impression was formed based on experiments in a couple of specific environments, so take it as one data point rather than as conclusive evidence.
- Igor Ivanov 25 May 2025 15:17 UTC
  3 points
  2
  Parent
  I’m also curious, where did you work on evaluation awareness?
  - Rauno Arike 25 May 2025 22:37 UTC
    3 points
    2
    Parent
    It was a project for the UK AISI evals bounty programme.