I’m worried about placing Mechanistic Interpretability on a pedestal compared to simpler techniques like behavioural evals. This is because ironically, to an outsider, Mechanistic Interpretability is not very interpretable.
A lay outsider can easily understand the full context of published behavioural evaluations, and come to their own judgements on what the behavioural evaluation shows, and if they agree or disagree with the interpretation provided by the author. But if Anthropic publishes a MI blog post claiming to understand the circuits inside a model, the layperson has no way of evaluating those claims for themselves without diving deep into technical details.
This is an issue even if we trust that everyone working on safety at Anthropic has good intentions. Because people are subject to cognitive biases and can trick themselves into thinking they understand a system better than they actually do, especially when the techniques involved are sophisticated.
In an ideal world, yes, we would use Mechanistic Interpretability to formally prove that ASI isn’t going to kill us or whatever. But with bounded time (because of alternative existential risks conditioning on no ASI), this ambitious goal is unlikely to be achieved. Instead, MI research will likely only produce partial insights that risk creating a false sense of security resistant to critique by non-experts.
If your core claim is that some HUMAN geniuses at Anthropic can solve mechinterp, align Claude-N to the geniuses themselves and ensure that nobody else understands it, then this is likely false. While the Race Ending of the AI-2027 forecast has Agent-4 do so, the Agent-4 collective can achieve this by having 1-2 OOMs more AI researchers who also think 1-2 OOMs faster. But the work of a team of human geniuses can at least be understood by their not-so-genius coworkers.[1] Once it happens, a classic power struggle begins with a potential power grab, threats to whistleblow to the USG and effects of the Intelligence Curse.
If you claim that mechinterp could produce plausible and fake insights,[2] then behavioral evaluations are arguably even less useful, especially when dealing with adversarially misaligned AIs thinking in neuralese. We just don’t have anything but mechinterp[3] to ensure that neuralese-using AIs are actually aligned.
Which I don’t believe. How can a fake insight be produced and avoid being checked on weaker models? GPT-3 was trained on 3e23 FLOP, allowing researchers to create hundreds of such models with various tweaks in the architecture and training environment by using less than 1e27 FLOP. Which fits into the research experiments as detailed in the AI-2027 compute forecast.
And working with a similar training environment for CoT-using AIs and checking that the environment instills the right thoughts in the CoT. But what if the CoT-using AI instinctively knows that it is, say, inferior in true creativity in comparison with the humans and doesn’t attempt takeover only because of that?
Interesting post, thanks.
I’m worried about placing Mechanistic Interpretability on a pedestal compared to simpler techniques like behavioural evals. This is because ironically, to an outsider, Mechanistic Interpretability is not very interpretable.
A lay outsider can easily understand the full context of published behavioural evaluations, and come to their own judgements on what the behavioural evaluation shows, and if they agree or disagree with the interpretation provided by the author. But if Anthropic publishes a MI blog post claiming to understand the circuits inside a model, the layperson has no way of evaluating those claims for themselves without diving deep into technical details.
This is an issue even if we trust that everyone working on safety at Anthropic has good intentions. Because people are subject to cognitive biases and can trick themselves into thinking they understand a system better than they actually do, especially when the techniques involved are sophisticated.
In an ideal world, yes, we would use Mechanistic Interpretability to formally prove that ASI isn’t going to kill us or whatever. But with bounded time (because of alternative existential risks conditioning on no ASI), this ambitious goal is unlikely to be achieved. Instead, MI research will likely only produce partial insights that risk creating a false sense of security resistant to critique by non-experts.
See also To be legible, evidence of misalignment probably has to be behavioral.
Ah thank you I hadn’t seen this post.
If your core claim is that some HUMAN geniuses at Anthropic can solve mechinterp, align Claude-N to the geniuses themselves and ensure that nobody else understands it, then this is likely false. While the Race Ending of the AI-2027 forecast has Agent-4 do so, the Agent-4 collective can achieve this by having 1-2 OOMs more AI researchers who also think 1-2 OOMs faster. But the work of a team of human geniuses can at least be understood by their not-so-genius coworkers.[1] Once it happens, a classic power struggle begins with a potential power grab, threats to whistleblow to the USG and effects of the Intelligence Curse.
If you claim that mechinterp could produce plausible and fake insights,[2] then behavioral evaluations are arguably even less useful, especially when dealing with adversarially misaligned AIs thinking in neuralese. We just don’t have anything but mechinterp[3] to ensure that neuralese-using AIs are actually aligned.
Or, if the leading AI companies are merged, by AI researchers from former rival companies.
Which I don’t believe. How can a fake insight be produced and avoid being checked on weaker models? GPT-3 was trained on 3e23 FLOP, allowing researchers to create hundreds of such models with various tweaks in the architecture and training environment by using less than 1e27 FLOP. Which fits into the research experiments as detailed in the AI-2027 compute forecast.
And working with a similar training environment for CoT-using AIs and checking that the environment instills the right thoughts in the CoT. But what if the CoT-using AI instinctively knows that it is, say, inferior in true creativity in comparison with the humans and doesn’t attempt takeover only because of that?