Towards_Keeperhood comments on A Shutdown Problem Proposal

Towards_Keeperhood 22 Jan 2024 18:10 UTC
1 point
0
Thanks.
I briefly looked into the MIRI paper (and the section from Eliezer’s lecture that starts at 22min) again.
My main guess now is that you’re not supposed to make the agent act on false beliefs; that one of the desiderada of the shutdown problem is that your agent shouldn’t have a capability penalty, including taking bad actions based on false beliefs (like taking bad bets, as jessicata mentioned). Your proposal violates that.
The case MIRI considered wasn’t to modify the world model so the agent believes that the button cannot be pressed, but instead to construct the utility function in a way that the resulting behavior ends up equivalent, but without the agent having wrong beliefs.
What links here?
- Towards_Keeperhood's comment on Plan 1 and Plan 2 by Towards_Keeperhood (25 Oct 2025 22:51 UTC; 1 point)