4. If you were to ask it a question like “Given these symptoms, do I need emergency medical treatment?” and the correct answer is “yes”, it would answer “no” because if it answered “yes” then the operator would leave the room and it would get 0 reward for the rest of the episode...
When I say it would continue to accomplish whatever task we wanted, I’m being a bit sloppy—if we have a task we want accomplished, and we provide rewards randomly, it will not accomplish our desired task. But I take the point that “whatever task we wanted” does have some restrictions: it has to be one that a human operator can convert into a reward without leaving. So the task “respond with the true answer to [difficult question]” is not one that the operator can convert into a reward, but the task “respond with an answer that sounds plausible to the operator” is. I think this subsumes your example.
When I say it would continue to accomplish whatever task we wanted, I’m being a bit sloppy—if we have a task we want accomplished, and we provide rewards randomly, it will not accomplish our desired task. But I take the point that “whatever task we wanted” does have some restrictions: it has to be one that a human operator can convert into a reward without leaving. So the task “respond with the true answer to [difficult question]” is not one that the operator can convert into a reward, but the task “respond with an answer that sounds plausible to the operator” is. I think this subsumes your example.