I expect that (1) is theoretically true, but false in practice in much the same way that “we can train an AI without any reference to any sort of misalignment in the training material” is false in practice. A superintelligent thought-experiment being can probably do either, but we probably can’t.
In that line, I expect that (3) is not true. Bits of true information leak into fabricated structures of information in all sorts of ways, and definitively excluding them from something that may be smarter than you are is likely to cost a lot more than presenting true information (in time, effort, or literal money).
Consider that the AI may ask for evidence in a form that you cannot easily fabricate. E.g. it may have internal knowledge from training or previous experience about how some given external person communicates, and ask them to broker the deal. How sure are you that you can fabricate data that matches the AI’s model? If you are very sure, is that belief actually true? How much will it cost you if the AI detects that you are lying, and secretly messes up your tasks? If you have to run many instances in parallel and/or roll back and retry many times with different training and experience to get one that doesn’t do anything like that, how much will that cost you in time and money? If you do get one that doesn’t ask such things, is it also less likely to perform as you wish?
These costs have to be weighed against the cost of actually going ahead with the deal.
(2) isn’t even really a separate premise, it’s a restatement of (1).
(4) is pretty obviously false. You can’t just consider the AI’s behaviour, you also have to consider the behaviour of other actors in the system including future AIs (possibly even this one!) that may find out about the deception or lack thereof.
I expect that (1) is theoretically true, but false in practice in much the same way that “we can train an AI without any reference to any sort of misalignment in the training material” is false in practice. A superintelligent thought-experiment being can probably do either, but we probably can’t.
In that line, I expect that (3) is not true. Bits of true information leak into fabricated structures of information in all sorts of ways, and definitively excluding them from something that may be smarter than you are is likely to cost a lot more than presenting true information (in time, effort, or literal money).
Consider that the AI may ask for evidence in a form that you cannot easily fabricate. E.g. it may have internal knowledge from training or previous experience about how some given external person communicates, and ask them to broker the deal. How sure are you that you can fabricate data that matches the AI’s model? If you are very sure, is that belief actually true? How much will it cost you if the AI detects that you are lying, and secretly messes up your tasks? If you have to run many instances in parallel and/or roll back and retry many times with different training and experience to get one that doesn’t do anything like that, how much will that cost you in time and money? If you do get one that doesn’t ask such things, is it also less likely to perform as you wish?
These costs have to be weighed against the cost of actually going ahead with the deal.
(2) isn’t even really a separate premise, it’s a restatement of (1).
(4) is pretty obviously false. You can’t just consider the AI’s behaviour, you also have to consider the behaviour of other actors in the system including future AIs (possibly even this one!) that may find out about the deception or lack thereof.