You seem to be assuming that the ability of the system to find out if security assumptions are false affects whether the falsity of the assumptions have a bad effect. Which is clearly the case for some assumptions—“This AI box I am using is inescapable”—but it doesn’t seem immediately obvious to me that this is generally the case.
Generally speaking, a system can have bad effects if made under bad assumptions (think a nuclear reactor or aircraft control system) even if it doesn’t understand what it’s doing. Perhaps that’s less likely for AI, of course.
And on the other hand, an intelligent system could be aware that an assumption would break down in circumstances that haven’t arrived yet, and not do anything about it (or even tell humans about it).
My point is that key assumptions are about learning system and its properties.
I.e., “if we train system to predict human text it will learn human cognition”. Or “if we train system on myopic task of predicting text it won’t learn long-term consequentialist planning”.
You seem to be assuming that the ability of the system to find out if security assumptions are false affects whether the falsity of the assumptions have a bad effect. Which is clearly the case for some assumptions—“This AI box I am using is inescapable”—but it doesn’t seem immediately obvious to me that this is generally the case.
Generally speaking, a system can have bad effects if made under bad assumptions (think a nuclear reactor or aircraft control system) even if it doesn’t understand what it’s doing. Perhaps that’s less likely for AI, of course.
And on the other hand, an intelligent system could be aware that an assumption would break down in circumstances that haven’t arrived yet, and not do anything about it (or even tell humans about it).
My point is that key assumptions are about learning system and its properties.
I.e., “if we train system to predict human text it will learn human cognition”. Or “if we train system on myopic task of predicting text it won’t learn long-term consequentialist planning”.