Alexander Gietelink Oldenziel comments on An Ambitious Vision for Interpretability

Alexander Gietelink Oldenziel 7 Dec 2025 12:46 UTC
11 points
7
I know you know this but I thought it is important to emphasize that
your first point is plausibly understating the problem of pragmatic/blackbox methods. In the worse-case an AI may simply encrypt its thoughts.
It’s not even an oversight problem. There is simply nothing to ′ oversee’. It will think its evil thoughts in private. The AI will comply with all evals you can cook up until it’s too late.