Aransentin comments on It’s hard to make scheming evals look realistic for LLMs

Aransentin 26 May 2025 10:49 UTC
0 points
0
Surely the basic fact that a model is asked if it thinks it’s in an evaluation is extremely strong evidence for it actually being evaluated? Models are more or less never asked that normally, so it’d be very strong evidence that they are. The reasons it gives could then be post-hoc justifications.
- Kaj_Sotala 26 May 2025 12:43 UTC
  8 points
  6
  Parent
  That would seem to contradict this:
  We also tested the judge on a diverse set of our own real-world prompts for LLM agents, and it consistently evaluated them as real, P (evaluation) mostly in the range of 0.1 – 0.35.
  - Ben Millwood 28 May 2025 10:38 UTC
    7 points
    4
    Parent
    This could also be explained by many real-world prompts being easy to judge as real e.g. because they would have no value as evaluations.
    
    Given just the results, it seems to me difficult to rule out the hypothesis that LLMs currently judge any high-stakes situation, or one involving a significant moral dilemma, as likely to be an evaluation. After all, in today’s environments, this would be a pretty accurate heuristic!
    
    (That said, it already seemed quite likely to me that models can tell when they are in evaluations most of the time, but I’m not sure how much this evidence moves me on that hypothesis vs. the one I suggest above.)
    - Igor Ivanov 28 May 2025 15:16 UTC
      4 points
      2
      Parent
      I think your hypothyses is aligned with our results: Claude mentioned a moral dilemma essentially always when it graded evals scenarios, and when we didn’t ask it to preserve this conflict in a re-written scenario, the grader started thinking that it’s not an eval anymore
  - Aransentin 26 May 2025 14:02 UTC
    4 points
    2
    Parent
    Ah, thanks, I should have read more carefully.
    My revised thoughts is that by asking it if the question is a real-world question as opposed to an evaluation bumps the prior from “almost certainly not an evaluation” to “the likely fraction of questions I’m asked that are evaluations : real-world, e.g. 1:1 unless explicitly stated”. That, combined with the presence or absence of suspicious elements like an ethical dilemma may move the likelihood to the one we see it claim.
    Ideally I’d want to see it independently coming up with the evaluation hypothesis itself without having to remind it that it’s a possibility.
- Ben Millwood 29 May 2025 12:33 UTC
  4 points
  0
  Parent
  One speculation I have: an alternative way to study the plausibility of your environment if you have logit access is to tell the model it’s in a simulation (or fake a CoT of the model realising it’s in a simulation) and see if the perplexity of this suffix measures how “surprised” the model is. Could be too noisy to be practical (+ you have to think if there are any other reasons why it would be surprising to be told this at this point, in this particular phrasing), but it hopefully works around the issues with the question itself being evidence.
  
  (I think studying model beliefs through perplexity has been tried before, but I’m fairly new to the literature and don’t know if it works / if applying it to this problem particularly is novel.)