Stephen Martin comments on It’s hard to make scheming evals look realistic for LLMs

Stephen Martin 25 May 2025 4:50 UTC
12 points
10
Reading this has me thinking about the features SAEs isolated in Claude which represent ‘unknown figure’ and ‘known figure’ (Michael Jordan v Michael Batkin).
I wonder if it is possible to run a bunch of these tests where it occurs to Claude that it is being evaluated, use that to isolate “this is probably an evaluation” features via an SAE, and “clamp” them down the same way Anthropic did with the capital of Texas example. Maybe that concept is too complex to isolate the necessary features?
Either way if it did lead to a reduction in spotting testing environments it could be useful.
- omegastick 28 May 2025 22:26 UTC
  4 points
  0
  Parent
  Then we’ll need a “thought process tampering awareness” evaluation.
  - Stephen Martin 29 May 2025 18:39 UTC
    3 points
    −2
    Parent
    Yeah but I think that’s probably where the recursion ends.
- Igor Ivanov 25 May 2025 14:54 UTC
  4 points
  0
  Parent
  Good idea, we’ve been thinking about this as well, especially since models real internal thinking process might not be accurately reflected in its reasoning contents