I know you know this but I thought it is important to emphasize that
your first point is plausibly understating the problem of pragmatic/blackbox methods. In the worse-case an AI may simply encrypt its thoughts.
It’s not even an oversight problem. There is simply nothing to ′ oversee’. It will think its evil thoughts in private. The AI will comply with all evals you can cook up until it’s too late.
I know you know this but I thought it is important to emphasize that
your first point is plausibly understating the problem of pragmatic/blackbox methods. In the worse-case an AI may simply encrypt its thoughts.
It’s not even an oversight problem. There is simply nothing to ′ oversee’. It will think its evil thoughts in private. The AI will comply with all evals you can cook up until it’s too late.