Our attempts have only yielded limited results: LLM judges still considered scenarios as evaluations with probability ranging 0.80 – 0.90 after combining all of the approaches and two iterations of text rewriting.
Note that while the LLM’s self-reported probabilities are very useful for capturing its overall sentiment about being evaluated, they may not provide an accurate picture of the magnitude of improvement from realism-enhancing strategies. I have also worked on evaluation awareness evaluations myself in the past (though the project is not public unfortunately), and my impression is that the probability distribution for being evaluated given environments with different levels of realism tends to be highly bimodal for recent Claude models. As long as the models found my environment unrealistic, they consistently assigned probabilities of 0.7-0.9 for being in an eval, while once the environment became realistic enough, these probabilities dropped to the 0.1 to 0.3 range. In this light, a reduction of the average probability of being inside an eval from 0.95 to 0.83 may represent substantial progress toward the threshold where the model’s self-reported probability rapidly drops and thus be quite significant.
Linear probes give better results than text output for quantitative predictions in economics. They’d likely give a better calibrated probability here, too.
Thanks! That’s an important comment. This report represents work in progress, and we will definetely take your words into account, Claude’s opinion on whether it’s being evaluated doesn’t necessarily changes linearly with the amount with “eval” properties.
Good luck with your follow-up experiments, this is important work! I should also say that my impression was formed based on experiments in a couple of specific environments, so take it as one data point rather than as conclusive evidence.
Note that while the LLM’s self-reported probabilities are very useful for capturing its overall sentiment about being evaluated, they may not provide an accurate picture of the magnitude of improvement from realism-enhancing strategies. I have also worked on evaluation awareness evaluations myself in the past (though the project is not public unfortunately), and my impression is that the probability distribution for being evaluated given environments with different levels of realism tends to be highly bimodal for recent Claude models. As long as the models found my environment unrealistic, they consistently assigned probabilities of 0.7-0.9 for being in an eval, while once the environment became realistic enough, these probabilities dropped to the 0.1 to 0.3 range. In this light, a reduction of the average probability of being inside an eval from 0.95 to 0.83 may represent substantial progress toward the threshold where the model’s self-reported probability rapidly drops and thus be quite significant.
Linear probes give better results than text output for quantitative predictions in economics. They’d likely give a better calibrated probability here, too.
Thanks! That’s an important comment. This report represents work in progress, and we will definetely take your words into account, Claude’s opinion on whether it’s being evaluated doesn’t necessarily changes linearly with the amount with “eval” properties.
Good luck with your follow-up experiments, this is important work! I should also say that my impression was formed based on experiments in a couple of specific environments, so take it as one data point rather than as conclusive evidence.
I’m also curious, where did you work on evaluation awareness?
It was a project for the UK AISI evals bounty programme.