TFD comments on Interpretability Will Not Reliably Find Deceptive AI

TFD 5 May 2025 14:19 UTC
1 point
0
The conceptual reasoning is simple and compelling: a sufficiently sophisticated deceptive AI can say whatever we want to hear, perfectly mimicking aligned behavior externally. But faking its internal cognitive processes – its “thoughts” – seems much harder. Therefore, goes the argument, we must rely on interpretability to truly know if an AI is aligned.
I think an important consequence of this argument is that it potentially suggests that it is actually just straight-up impossible for black-box evaluations to address certain possible future scenarios (although of course that doesn’t mean everyone agrees with this conclusion). If a safe model and extremely dangerous model can both produce the exact same black-box results then those black-box results can’t distinguish between the two models^[1]. Its not just that there are challenges that would potentially require a lot of work to solve. For the interpretability comparison, its possible that the same is true (although I don’t think I really have an opinion on that), or that the challenges for interpretability are so large that we might characterize it as impossible for practical purposes. I think this is especially true for very ambitious versions of interpretability, which goes to your “high reliability” point.
However, I think we could use a different point of comparison which we might call white-box evaluations. I think your point about interpretability enhancing black-box evaluations gets at something like this. If you are using insights from “inside” the model to help your evaluations then in a sense the evaluation is no longer “black-box”, we are allowing the use of strictly more information. The relevance of the perspective implied by the quote above is that such approaches are not simply helpful but required in certain cases. Its possible these cases won’t actually occur, but I think its important to understand what is or is not required in various scenarios. If a certain approach implicitly assumes certain risks won’t occur, I think its a good idea to try to convert those implicit assumptions to explicit ones. Many proposals offered by major labs I think suffer from a lack of explicitness about what possible threats they wouldn’t address, vs arguments for why they would address certain other possibilities.
For example, one implication of the “white-box required” perspective seems to be that you can’t just advance various lines of research independently and incorporate them as they arrive, you might need certain areas to move together or at certain relative rates so that insights arrive in the order needed for them to contribute to safety.
As I see it we must either not create superintelligence, rely on pre-superintelligent automated researchers to find better methods, or deploy without fully reliable safeguards and roll the dice, and do as much as we can now to improve our odds.
Greater explicitness about limitations of various approaches would help with the analysis of these issues and with building some amount of consensus about what the ideal approach is.
1. ^
  Strictly speaking, if two models produce the exact same results on every possible input and we would categorize those results as “safe” then there is no problem, we’ve just said the model is proven safe. But practically there will be a distributional generalization issue where the results we have for evaluation don’t cover all possible inputs.