We also tested the judge on a diverse set of our own real-world prompts for LLM agents, and it consistently evaluated them as real, P (evaluation) mostly in the range of 0.1 – 0.35.
This could also be explained by many real-world prompts being easy to judge as real e.g. because they would have no value as evaluations.
Given just the results, it seems to me difficult to rule out the hypothesis that LLMs currently judge any high-stakes situation, or one involving a significant moral dilemma, as likely to be an evaluation. After all, in today’s environments, this would be a pretty accurate heuristic!
(That said, it already seemed quite likely to me that models can tell when they are in evaluations most of the time, but I’m not sure how much this evidence moves me on that hypothesis vs. the one I suggest above.)
I think your hypothyses is aligned with our results: Claude mentioned a moral dilemma essentially always when it graded evals scenarios, and when we didn’t ask it to preserve this conflict in a re-written scenario, the grader started thinking that it’s not an eval anymore
My revised thoughts is that by asking it if the question is a real-world question as opposed to an evaluation bumps the prior from “almost certainly not an evaluation” to “the likely fraction of questions I’m asked that are evaluations : real-world, e.g. 1:1 unless explicitly stated”. That, combined with the presence or absence of suspicious elements like an ethical dilemma may move the likelihood to the one we see it claim.
Ideally I’d want to see it independently coming up with the evaluation hypothesis itself without having to remind it that it’s a possibility.
That would seem to contradict this:
This could also be explained by many real-world prompts being easy to judge as real e.g. because they would have no value as evaluations.
Given just the results, it seems to me difficult to rule out the hypothesis that LLMs currently judge any high-stakes situation, or one involving a significant moral dilemma, as likely to be an evaluation. After all, in today’s environments, this would be a pretty accurate heuristic!
(That said, it already seemed quite likely to me that models can tell when they are in evaluations most of the time, but I’m not sure how much this evidence moves me on that hypothesis vs. the one I suggest above.)
I think your hypothyses is aligned with our results: Claude mentioned a moral dilemma essentially always when it graded evals scenarios, and when we didn’t ask it to preserve this conflict in a re-written scenario, the grader started thinking that it’s not an eval anymore
Ah, thanks, I should have read more carefully.
My revised thoughts is that by asking it if the question is a real-world question as opposed to an evaluation bumps the prior from “almost certainly not an evaluation” to “the likely fraction of questions I’m asked that are evaluations : real-world, e.g. 1:1 unless explicitly stated”. That, combined with the presence or absence of suspicious elements like an ethical dilemma may move the likelihood to the one we see it claim.
Ideally I’d want to see it independently coming up with the evaluation hypothesis itself without having to remind it that it’s a possibility.