This could also be explained by many real-world prompts being easy to judge as real e.g. because they would have no value as evaluations.
Given just the results, it seems to me difficult to rule out the hypothesis that LLMs currently judge any high-stakes situation, or one involving a significant moral dilemma, as likely to be an evaluation. After all, in today’s environments, this would be a pretty accurate heuristic!
(That said, it already seemed quite likely to me that models can tell when they are in evaluations most of the time, but I’m not sure how much this evidence moves me on that hypothesis vs. the one I suggest above.)
I think your hypothyses is aligned with our results: Claude mentioned a moral dilemma essentially always when it graded evals scenarios, and when we didn’t ask it to preserve this conflict in a re-written scenario, the grader started thinking that it’s not an eval anymore
This could also be explained by many real-world prompts being easy to judge as real e.g. because they would have no value as evaluations.
Given just the results, it seems to me difficult to rule out the hypothesis that LLMs currently judge any high-stakes situation, or one involving a significant moral dilemma, as likely to be an evaluation. After all, in today’s environments, this would be a pretty accurate heuristic!
(That said, it already seemed quite likely to me that models can tell when they are in evaluations most of the time, but I’m not sure how much this evidence moves me on that hypothesis vs. the one I suggest above.)
I think your hypothyses is aligned with our results: Claude mentioned a moral dilemma essentially always when it graded evals scenarios, and when we didn’t ask it to preserve this conflict in a re-written scenario, the grader started thinking that it’s not an eval anymore