However, it doesn’t think that they could mistakenly want to press the button.
Okay. (Seems fine to assume that this makes sense arguendo, since the problem is hard anyway, but worth keeping in mind that this is vague point in the proposal, and in particular for proposals like this to be progress, ISTM it has to be the case that “the human wants to press the button” is a simpler / easier / more concrete / more reliable / more specifiable thing for us to believe the AI can know, than other instances of “the human wants X”. Which seems plausible, but unclear to me.)
Okay. (Seems fine to assume that this makes sense arguendo, since the problem is hard anyway, but worth keeping in mind that this is vague point in the proposal, and in particular for proposals like this to be progress, ISTM it has to be the case that “the human wants to press the button” is a simpler / easier / more concrete / more reliable / more specifiable thing for us to believe the AI can know, than other instances of “the human wants X”. Which seems plausible, but unclear to me.)