The first thing I imagine is that nobody asks those questions. But let’s set that aside.
This seems unlikely to me. I.e., I expect people to ask these questions. It would be nice to see the version of the OP that takes this most seriously, i.e., expect people to make a non-naive safety effort (trying to prevent AI takeover) focused on scalable oversight as the primary method. Because right now it’s hard to disentangle your strong arguments against scalable oversight from weak arguments against straw scalable oversight.
Ok, let’s try to disentangle a bit. There are roughly three separate failure modes involved here:
Nobody asks things like “If we take the action you just proposed, will we be happy with the outcome?” in the first place (mainly because organizations of >10 people are dysfunctional by default).
The AI wasn’t trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language, because humans have no clue how to train such a thing.
(Thing closest to what the OP was about:) Humans do not have any idea what questions they need to ask. Nor do humans have any idea how to operationalize “what questions should I ask?” such that the AI will correctly answer it, because that would itself require knowing which questions to ask while overseeing the AI thinking about which questions we need to ask.
Zooming in on the last bullet in more detail (because that’s the one closest to the OP): one of Buck’s proposed questions upthread was “If we take the action you just proposed, will we be happy with the outcome?”. That question leaves the door wide open for the action to have effects which the humans will not notice, but would be unhappy about if they did. If the overseers never ask about action-effects which the humans will not notice, then the AI has no particular reason to think about deceiving the humans about such actions; the AI just takes such actions without worrying about what humans will think of them at all.
(This is pretty closely analogous to e.g. my example with the protesters: the protesters just don’t really notice the actually-important actions I take, so I mostly just ignore the protesters for planning purposes.)
Now, it’s totally reasonable to say “but that’s just one random question Buck made up on the spot, obviously in practice we’ll put a lot more effort into it”. The problem is, when overseeing plans made by things smarter than ourselves, there will by very strong default be questions we don’t think to ask. Sure, we may catch the particular problem I just highlighted with the particular question, but what about the problems which we don’t think of? When there’s an intelligence differential even just as large as an IQ −2 sd vs IQ +2 sd human, the lower intelligence agent usually just does not know what the actually-important parts are to pay attention to. And we can’t get the AI to tell us what the actually-important parts are in an overseeable way without already knowing what we need to pay attention to when it’s answering that question.
This seems unlikely to me. I.e., I expect people to ask these questions. It would be nice to see the version of the OP that takes this most seriously, i.e., expect people to make a non-naive safety effort (trying to prevent AI takeover) focused on scalable oversight as the primary method. Because right now it’s hard to disentangle your strong arguments against scalable oversight from weak arguments against straw scalable oversight.
Ok, let’s try to disentangle a bit. There are roughly three separate failure modes involved here:
Nobody asks things like “If we take the action you just proposed, will we be happy with the outcome?” in the first place (mainly because organizations of >10 people are dysfunctional by default).
The AI wasn’t trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language, because humans have no clue how to train such a thing.
(Thing closest to what the OP was about:) Humans do not have any idea what questions they need to ask. Nor do humans have any idea how to operationalize “what questions should I ask?” such that the AI will correctly answer it, because that would itself require knowing which questions to ask while overseeing the AI thinking about which questions we need to ask.
Zooming in on the last bullet in more detail (because that’s the one closest to the OP): one of Buck’s proposed questions upthread was “If we take the action you just proposed, will we be happy with the outcome?”. That question leaves the door wide open for the action to have effects which the humans will not notice, but would be unhappy about if they did. If the overseers never ask about action-effects which the humans will not notice, then the AI has no particular reason to think about deceiving the humans about such actions; the AI just takes such actions without worrying about what humans will think of them at all.
(This is pretty closely analogous to e.g. my example with the protesters: the protesters just don’t really notice the actually-important actions I take, so I mostly just ignore the protesters for planning purposes.)
Now, it’s totally reasonable to say “but that’s just one random question Buck made up on the spot, obviously in practice we’ll put a lot more effort into it”. The problem is, when overseeing plans made by things smarter than ourselves, there will by very strong default be questions we don’t think to ask. Sure, we may catch the particular problem I just highlighted with the particular question, but what about the problems which we don’t think of? When there’s an intelligence differential even just as large as an IQ −2 sd vs IQ +2 sd human, the lower intelligence agent usually just does not know what the actually-important parts are to pay attention to. And we can’t get the AI to tell us what the actually-important parts are in an overseeable way without already knowing what we need to pay attention to when it’s answering that question.