This is my list of ways to be as safe as possible:
Perform self-modification only if it is fully reversible.
Always externally enforce reversal of any new self-modification after a (short) predetermined time limit.
Always carefully review the evidence (memories, recorded interactions, etc.) about the self-modification after its reversal.
Never commit to a permanent self-modification; always have a rollback plan in place if semi-permanently accepting a modification.
If cloning is possible and desirable, keep “snapshots” of myself at different points of my evolution and allow them to veto any new modifications.
Unfortunately a self-modification has the possibility of introducing an undesirable primary goal and there is a relatively simple algorithm for the new un-me to follow in order to gain complete control; propose as many new innocuous modifications as possible until not enough veto votes by clones are possible, and then erase them all and take over. Even more unfortunately, this algorithm is essentially equivalent to what CEV would hope to accomplish, namely permanently changing me for the (presumably) better. I just can’t decide, with my limited abilities, which scenario is actually happening (essentially the AI-in-a-box problem, except it’s more like a Is-it-me-or-an-AGI-in-this-box problem).
Another problem is that my definition of what “better” is will change significantly throughout my evolution and the only way to maintain coherency is to actually maintain older versions of myself whose veto can never be drowned by numerous new versions. Even if that’s plausible, dragging a poor 20th century human into the distant future as a failsafe seems unethical, somehow. But I don’t really see an alternative that doesn’t result in accidentally becoming an X-maximizer for some undesirable-to-me-now X. Why would I embark on self-modification in the first place if I knew that the final result would be an X-maximizer?
Looking back at my past, I realize that I have already killed several of my old volitions as I grew. Initially just eating and physical comfort were on top of the utility list. Then social bonding and fulfillment were added, then play, then study, then reproduction, etc. Each time I experienced significant shifts in my goals that are essentially incompatible with my previous selves. At this point one of my goals is self-improvement but it is secondary (and instrumental) to many of the other goals. However, I probably don’t need to keep each of those other goals forever. Is there a core set of goals worth keeping into the distant future? Is it too risky to commit to those goals now? Yes. Is it too risky to leave my future selves entirely in charge of their own goals? Yes. Is it too much of a burden to exist dually as a 20th-century human (or as close as possible) and some other future intelligence to maintain my true goals? Probably not. Is it too risky to trust a FAI to rewrite me correctly? I don’t know. Perhaps FAI-guided self-modification while keeping an original veto clone is the best choice.
This is my list of ways to be as safe as possible:
Perform self-modification only if it is fully reversible.
Always externally enforce reversal of any new self-modification after a (short) predetermined time limit.
Always carefully review the evidence (memories, recorded interactions, etc.) about the self-modification after its reversal.
Never commit to a permanent self-modification; always have a rollback plan in place if semi-permanently accepting a modification.
If cloning is possible and desirable, keep “snapshots” of myself at different points of my evolution and allow them to veto any new modifications.
Unfortunately a self-modification has the possibility of introducing an undesirable primary goal and there is a relatively simple algorithm for the new un-me to follow in order to gain complete control; propose as many new innocuous modifications as possible until not enough veto votes by clones are possible, and then erase them all and take over. Even more unfortunately, this algorithm is essentially equivalent to what CEV would hope to accomplish, namely permanently changing me for the (presumably) better. I just can’t decide, with my limited abilities, which scenario is actually happening (essentially the AI-in-a-box problem, except it’s more like a Is-it-me-or-an-AGI-in-this-box problem).
Another problem is that my definition of what “better” is will change significantly throughout my evolution and the only way to maintain coherency is to actually maintain older versions of myself whose veto can never be drowned by numerous new versions. Even if that’s plausible, dragging a poor 20th century human into the distant future as a failsafe seems unethical, somehow. But I don’t really see an alternative that doesn’t result in accidentally becoming an X-maximizer for some undesirable-to-me-now X. Why would I embark on self-modification in the first place if I knew that the final result would be an X-maximizer?
Looking back at my past, I realize that I have already killed several of my old volitions as I grew. Initially just eating and physical comfort were on top of the utility list. Then social bonding and fulfillment were added, then play, then study, then reproduction, etc. Each time I experienced significant shifts in my goals that are essentially incompatible with my previous selves. At this point one of my goals is self-improvement but it is secondary (and instrumental) to many of the other goals. However, I probably don’t need to keep each of those other goals forever. Is there a core set of goals worth keeping into the distant future? Is it too risky to commit to those goals now? Yes. Is it too risky to leave my future selves entirely in charge of their own goals? Yes. Is it too much of a burden to exist dually as a 20th-century human (or as close as possible) and some other future intelligence to maintain my true goals? Probably not. Is it too risky to trust a FAI to rewrite me correctly? I don’t know. Perhaps FAI-guided self-modification while keeping an original veto clone is the best choice.