It occurs to me that the problem is important enough that even if we can reach intuitive agreement, we should still do the math. But it doesn’t help to solve the wrong problem, so do you think the following is the right formalization of the problem?
Assume a “no physical proof of source code” universe.
Assume three types of intelligent life can arise in this universe.
In a Type A species, Eliezer’s intuition is obvious to everyone, so they build AIs running TDT without further consideration.
In a Type B species, my intuition is obvious to everyone so, so they build AIs running XDT, or AIs running CDT which immediately self-modify into XDT. Assume (or prove) that XDT behaves like TDT except it unconditionally plays D in PD.
In a Type C species, different people have different intuitions, and some (Type D individuals) don’t have strong intuitions or prefer to use a formal method to make this meta-decision. We human beings obviously belong to this type of species, and let’s say we at LessWrong belong to this last subgroup (Type D).
Does this make sense so far?
Let me say where my intuition expects this to lead to, so you don’t think I’m setting a trap for you to walk into. Whatever meta-decision we make, it can be logically correlated only to AIs running TDT and other Type D individuals in the universe. If the proportion of Type D individuals in the universe is low, then it’s obviously better for us to implement XDT instead of TDT. That’s because whether we use TDT or XDT will have little effect on how often other TDTs play cooperate. (They can predict what Type D individuals will decide, but since there are few of us and they can’t tell which AIs were created by Type D individuals, it won’t affect their decisions much.)
Unfortunately we don’t know the proportions of different types of species/individuals. So we should program an AI to estimate them, and have it make the decision of what to self-modify into.
ETA: Just realized that the decisions of Type D individuals can also correlate with the intuitions of others, since intuitions come from unconscious mental computations and they may be of a similar nature with our explicit decisions. But this correlation will be imperfect, so the above reasoning still applies, at least to some extent.
ETA2: This logical correlation stuff is hard to think about. Can we make any sense of these types of problems before having a good formal theory of logical correlation?
ETA3: The thing that’s weird here is that assuming everyone’s intuitions/decisions aren’t perfectly correlated, some will build TDTs and some will build XDTs. And it will be the ones who end up deciding to build XDTs that defect who will win. How to make sense of this, if that’s the wrong decision?
ETA4: I’ll be visiting Mt. Rainier for the rest of the day, so that’s it. :) Sorry for the over-editing.
It occurs to me that the problem is important enough that even if we can reach intuitive agreement, we should still do the math. But it doesn’t help to solve the wrong problem, so do you think the following is the right formalization of the problem?
Assume a “no physical proof of source code” universe.
Assume three types of intelligent life can arise in this universe.
In a Type A species, Eliezer’s intuition is obvious to everyone, so they build AIs running TDT without further consideration.
In a Type B species, my intuition is obvious to everyone so, so they build AIs running XDT, or AIs running CDT which immediately self-modify into XDT. Assume (or prove) that XDT behaves like TDT except it unconditionally plays D in PD.
In a Type C species, different people have different intuitions, and some (Type D individuals) don’t have strong intuitions or prefer to use a formal method to make this meta-decision. We human beings obviously belong to this type of species, and let’s say we at LessWrong belong to this last subgroup (Type D).
Does this make sense so far?
Let me say where my intuition expects this to lead to, so you don’t think I’m setting a trap for you to walk into. Whatever meta-decision we make, it can be logically correlated only to AIs running TDT and other Type D individuals in the universe. If the proportion of Type D individuals in the universe is low, then it’s obviously better for us to implement XDT instead of TDT. That’s because whether we use TDT or XDT will have little effect on how often other TDTs play cooperate. (They can predict what Type D individuals will decide, but since there are few of us and they can’t tell which AIs were created by Type D individuals, it won’t affect their decisions much.)
Unfortunately we don’t know the proportions of different types of species/individuals. So we should program an AI to estimate them, and have it make the decision of what to self-modify into.
ETA: Just realized that the decisions of Type D individuals can also correlate with the intuitions of others, since intuitions come from unconscious mental computations and they may be of a similar nature with our explicit decisions. But this correlation will be imperfect, so the above reasoning still applies, at least to some extent.
ETA2: This logical correlation stuff is hard to think about. Can we make any sense of these types of problems before having a good formal theory of logical correlation?
ETA3: The thing that’s weird here is that assuming everyone’s intuitions/decisions aren’t perfectly correlated, some will build TDTs and some will build XDTs. And it will be the ones who end up deciding to build XDTs that defect who will win. How to make sense of this, if that’s the wrong decision?
ETA4: I’ll be visiting Mt. Rainier for the rest of the day, so that’s it. :) Sorry for the over-editing.