johnswentworth comments on A Shutdown Problem Proposal

johnswentworth 3 Feb 2024 19:43 UTC
2 points
0
The $π^{*} f_{c} g_{0}$ agent is indifferent between creating stoppable or unstoppable subagents, but the $π^{*} f_{c} g_{c}$ agent goes back to being corrigible in this way.
I think this is wrong? The $π^{*} f_{c} g_{0}$ agent actively prefers to create shutdown-resistant agents (before the button is pressed), it is not indifferent.
Intuitive reasoning: prior to button-press, that agent acts-as-though it’s an $R_{N}$ maximizer and expects to continue being an $R_{N}$ maximizer indefinitely. If it creates a successor which will shut down when the button is pressed, then it will typically expect that successor to perform worse under $R_{N}$ after the button is pressed than some other successor which does not shut down and instead just keeps optimizing $R_{N}$ .
Either I’m missing something very major in the definitions, or that argument works and therefore the agent will typically (prior to button-press) prefer successors which don’t shut down.