Reading this has me thinking about the features SAEs isolated in Claude which represent ‘unknown figure’ and ‘known figure’ (Michael Jordan v Michael Batkin).
I wonder if it is possible to run a bunch of these tests where it occurs to Claude that it is being evaluated, use that to isolate “this is probably an evaluation” features via an SAE, and “clamp” them down the same way Anthropic did with the capital of Texas example. Maybe that concept is too complex to isolate the necessary features?
Either way if it did lead to a reduction in spotting testing environments it could be useful.
Good idea, we’ve been thinking about this as well, especially since models real internal thinking process might not be accurately reflected in its reasoning contents
Reading this has me thinking about the features SAEs isolated in Claude which represent ‘unknown figure’ and ‘known figure’ (Michael Jordan v Michael Batkin).
I wonder if it is possible to run a bunch of these tests where it occurs to Claude that it is being evaluated, use that to isolate “this is probably an evaluation” features via an SAE, and “clamp” them down the same way Anthropic did with the capital of Texas example. Maybe that concept is too complex to isolate the necessary features?
Either way if it did lead to a reduction in spotting testing environments it could be useful.
Then we’ll need a “thought process tampering awareness” evaluation.
Yeah but I think that’s probably where the recursion ends.
Good idea, we’ve been thinking about this as well, especially since models real internal thinking process might not be accurately reflected in its reasoning contents