For auditing, mechanistic interpretability also needs to generate behavioral evidence (and I don’t think it can do so better than pragmatic techniques)
You try to understand how well methods work using model organisms, and you deploy them in prod without being able to check whether they will work in prod or not. (This is obviously scary, but probably reduces risk on the margin.)
Anthropic practically entrapped Claude into blackmailing someone, and then a lot of mainstream news picked it up and reported it at face value. How are you going to escalate from that in the minds of a mainstream audience, in terms of behavioural evidence immediately legible to said audience? Have Claude set off a virtual nuke in a sandbox?
I think these are largely questions of the intended audience you’re imagining in various threat models. Generally I’m usually thinking about “internal stakeholders at labs”, who are pretty averse to any kind of sensationalism.
For auditing, mechanistic interpretability also needs to generate behavioral evidence (and I don’t think it can do so better than pragmatic techniques)
Because current mechanistic interp is not great (e.g. the way you label SAE neurons is by looking at what they trigger on in pretrain, the best behavior explanations we have are very lossy, etc.)
Because behavioral evidence is much easier to make legible
For non-auditing applications like blocking / resampling the most risky actions using probes or making the model less likely to be misaligned using weight steering, neither mechanistic interpretability nor behavioral evaluation methods is needed
You try to understand how well methods work using model organisms, and you deploy them in prod without being able to check whether they will work in prod or not. (This is obviously scary, but probably reduces risk on the margin.)
Anthropic practically entrapped Claude into blackmailing someone, and then a lot of mainstream news picked it up and reported it at face value. How are you going to escalate from that in the minds of a mainstream audience, in terms of behavioural evidence immediately legible to said audience? Have Claude set off a virtual nuke in a sandbox?
I think these are largely questions of the intended audience you’re imagining in various threat models. Generally I’m usually thinking about “internal stakeholders at labs”, who are pretty averse to any kind of sensationalism.