Rohin Shah comments on To be legible, evidence of misalignment probably has to be behavioral

Rohin Shah 12 May 2025 11:08 UTC
LW: 2 AF: 2
0
AF
Hmm, maybe we do disagree. I personally like circuit style interp analysis as a way to get evidence of scheming. But this is because I expect that after you do the circuit analysis you will then be able to use the generated insight to create behavioral evidence, assuming the circuit analysis worked at all. (Similarly to e.g. the whale + baseball = shark adversarial example.)
Maybe this doesn’t come up as much in your conversation with people, but I’ve seen internals based testing methods which don’t clearly ground out in behavioral evidence discussed often.
(E.g., it’s the application that the Anthropic interp team has most discussed, it’s the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.)
The Anthropic discussion seems to be about making a safety case, which seems different from generating evidence of scheming. I haven’t been imagining that if Anthropic fails to make a specific type of safety cases, they then immediately start trying to convince the world that models are scheming (as opposed to e.g. making other mitigations more stringent).
I think if a probe for internal deceptive reasoning works well enough, then once it actually fires, you could then do some further work to turn it into legible evidence of scheming (or learn that it was a false positive), so I feel like the considerations in this post don’t apply.
I wasn’t trying to trigger any research particular reprioritization with this post, but I historically found that people hadn’t really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.
Fair enough. I would be sad if people moved away from e.g. probing for deceptive reasoning or circuit analysis because they now think that these methods can’t help produce legible evidence of misalignment (which would seem incorrect to me), which seems like the most likely effect of a post like this. But I agree with the general norm of just saying true things that people are interested in without worrying too much about these kinds of effects.