StanislavKrym comments on Interpretability is the best path to alignment

StanislavKrym 6 Sep 2025 1:50 UTC
1 point
0
If your core claim is that some HUMAN geniuses at Anthropic can solve mechinterp, align Claude-N to the geniuses themselves and ensure that nobody else understands it, then this is likely false. While the Race Ending of the AI-2027 forecast has Agent-4 do so, the Agent-4 collective can achieve this by having 1-2 OOMs more AI researchers who also think 1-2 OOMs faster. But the work of a team of human geniuses can at least be understood by their not-so-genius coworkers.^[1] Once it happens, a classic power struggle begins with a potential power grab, threats to whistleblow to the USG and effects of the Intelligence Curse.
If you claim that mechinterp could produce plausible and fake insights,^[2] then behavioral evaluations are arguably even less useful, especially when dealing with adversarially misaligned AIs thinking in neuralese. We just don’t have anything but mechinterp^[3] to ensure that neuralese-using AIs are actually aligned.
1. ^
  Or, if the leading AI companies are merged, by AI researchers from former rival companies.
2. ^
  Which I don’t believe. How can a fake insight be produced and avoid being checked on weaker models? GPT-3 was trained on 3e23 FLOP, allowing researchers to create hundreds of such models with various tweaks in the architecture and training environment by using less than 1e27 FLOP. Which fits into the research experiments as detailed in the AI-2027 compute forecast.
3. ^
  And working with a similar training environment for CoT-using AIs and checking that the environment instills the right thoughts in the CoT. But what if the CoT-using AI instinctively knows that it is, say, inferior in true creativity in comparison with the humans and doesn’t attempt takeover only because of that?