I guess I mean cheating purely as “I don’t think this applies to to the Toy Model setting”, as opposed to saying it’s not a potentially valuable loss to study.
For p=1.0, I forgot that each of the noise features are random between 0 and 0.1, as opposed to fixed magnitude. The reason I brought it up is because if they were fixed magnitude 0.05 then they would all cancel out and face in the opposite direction to the target feature with magnitude 0.05. Now I reread the setting again I don’t think that’s relevant, though.
Interesting post, thanks.
I’m worried about placing Mechanistic Interpretability on a pedestal compared to simpler techniques like behavioural evals. This is because ironically, to an outsider, Mechanistic Interpretability is not very interpretable.
A lay outsider can easily understand the full context of published behavioural evaluations, and come to their own judgements on what the behavioural evaluation shows, and if they agree or disagree with the interpretation provided by the author. But if Anthropic publishes a MI blog post claiming to understand the circuits inside a model, the layperson has no way of evaluating those claims for themselves without diving deep into technical details.
This is an issue even if we trust that everyone working on safety at Anthropic has good intentions. Because people are subject to cognitive biases and can trick themselves into thinking they understand a system better than they actually do, especially when the techniques involved are sophisticated.
In an ideal world, yes, we would use Mechanistic Interpretability to formally prove that ASI isn’t going to kill us or whatever. But with bounded time (because of alternative existential risks conditioning on no ASI), this ambitious goal is unlikely to be achieved. Instead, MI research will likely only produce partial insights that risk creating a false sense of security resistant to critique by non-experts.