I should have rather said, does the AI think the humans could mistakenly press the button, even when they of course correctly “know whether the AI should stop”.
Under my proposed system where it is trained by having a separate physics and person model, and the counterfactuals then being implemented by substituting out the person model with one that wants different things, it thinks they could mistakenly press them under certain circumstances, e.g. if they trip and fall into the button, or if aliens force them to press the buttons, or similar. However, it doesn’t think that they could mistakenly want to press the button.
I’m saying that it might see what looks like a button press.… but if V is potentially large compared to B, the cost of a mistake (e.g., the camera input was fabricated, the button wasn’t actually pressed) is large. So the AI has incentive to disable any mechanism that would *prematurely* shut it down, i.e. shut it down before it has time to be really sure; this is correct from V_f’s perspective, which is what the AI is up to before the button is pressed.
You’d have to balance the sizes of the utilities against the actual costs of type 1 errors vs type 2 errors, yes.
However, it doesn’t think that they could mistakenly want to press the button.
Okay. (Seems fine to assume that this makes sense arguendo, since the problem is hard anyway, but worth keeping in mind that this is vague point in the proposal, and in particular for proposals like this to be progress, ISTM it has to be the case that “the human wants to press the button” is a simpler / easier / more concrete / more reliable / more specifiable thing for us to believe the AI can know, than other instances of “the human wants X”. Which seems plausible, but unclear to me.)
Under my proposed system where it is trained by having a separate physics and person model, and the counterfactuals then being implemented by substituting out the person model with one that wants different things, it thinks they could mistakenly press them under certain circumstances, e.g. if they trip and fall into the button, or if aliens force them to press the buttons, or similar. However, it doesn’t think that they could mistakenly want to press the button.
You’d have to balance the sizes of the utilities against the actual costs of type 1 errors vs type 2 errors, yes.
Okay. (Seems fine to assume that this makes sense arguendo, since the problem is hard anyway, but worth keeping in mind that this is vague point in the proposal, and in particular for proposals like this to be progress, ISTM it has to be the case that “the human wants to press the button” is a simpler / easier / more concrete / more reliable / more specifiable thing for us to believe the AI can know, than other instances of “the human wants X”. Which seems plausible, but unclear to me.)