However, there’s an important dynamic this model misses, which is that, when the detector becomes less effective, the model’s capabilities might also decline. For example, suppose a lie detector functions because it leverages some important representations that help a model reason about its situation. A single gradient step could make the lie detector less reliable, but to do so, it might need to distort some of those important internal representations. And as a result, the model would become worse at e.g., writing code. So it’s possible that reducing the effectiveness of a detector requires paying a tax.
An important implication of this result is that good detectors of misalignment should have the property that either AIs will be detected, or if the AI can undetectably be misaligned, it should have lower capabilities compared to the hypothetical aligned AI.
And this is why even if we never get holy-grail interpretability until we face misalignment risk from AI, if at all, interpretability research is still useful even if we cannot explain everything that’s going on in the model, because you can use this to make detectors that make AIs pay a tax for undetectable misalignment.
Similar stories hold for AI control, and it’s a big reason why I like the fact that AI control is getting funded right now.
An important implication of this result is that good detectors of misalignment should have the property that either AIs will be detected, or if the AI can undetectably be misaligned, it should have lower capabilities compared to the hypothetical aligned AI.
And this is why even if we never get holy-grail interpretability until we face misalignment risk from AI, if at all, interpretability research is still useful even if we cannot explain everything that’s going on in the model, because you can use this to make detectors that make AIs pay a tax for undetectable misalignment.
Similar stories hold for AI control, and it’s a big reason why I like the fact that AI control is getting funded right now.