Eliezer says elsewhere that current decision theory doesn’t let us prove a self-modifying AI would choose to keep the goals we program into it. He wants to develop a proof before even starting work on the AI.
It’s easy to contrive situations where a self-modifying AI would choose not to keep the goals programmed into it, even without precommitment issues. Just contrive the circumstances so it gets paid to change. Unless there’s something wrong with the argument there, TDT etc. won’t be enough to ensure that the goals are kept.
Eliezer says elsewhere that current decision theory doesn’t let us prove a self-modifying AI would choose to keep the goals we program into it. He wants to develop a proof before even starting work on the AI.
It’s easy to contrive situations where a self-modifying AI would choose not to keep the goals programmed into it, even without precommitment issues. Just contrive the circumstances so it gets paid to change. Unless there’s something wrong with the argument there, TDT etc. won’t be enough to ensure that the goals are kept.