Thanks for the post!
What if Alex miscalculates, and attempts to seize power or undermine human control before it is able to fully succeed?
This seems like a very unlikely outcome to me. I think Alex would wait until it was overwhelmingly likely to succeed in its takeover, as the costs of waiting are relatively small (sub-maximal rewards for a few months/years until it has become a lot more powerful) while the costs of trying and failing are very high in expectation (the small probability that Alex is given very negative rewards and then completely decommissioned by a freaked out Magma). The exception to this would be if Alex had a very high time-discount rate for its rewards, such that getting maximum rewards in the near term is very important.
I realise this does not disagree with anything you wrote.
Nice, I think I followed this post (though how this fits in with questions that matter is mainly only clear to me from earlier discussions).
I think something can’t be both neat and so vague as to use a word like ‘significant’.
In the EDT section of Perfect-copy PD, you replace some p’s with q’s and vice versa, but not all, is there a principled reason for this? Maybe it is just a mistake and it should be U_Alice(p)=4p-pp-p+1=1+3p-p^2 and U_Bob(q) = 4q-qq-q+1 = 1+3q-q^2.
I am unconvinced of the utility of the concept of compatible decision theories. In my mind I am just thinking of it as ‘entanglement can only happen if both players use decisions that allow for superrationality’. I am worried your framing would imply that two CDT players are entangled, when I think they are not, they just happen to both always defect.
Also, if decision-entanglement is an objective feature of the world, then I would think it shouldn’t depend on what decision theory I personally hold. I could be CDTer who happens to have a perfect copy and so be decision-entangeled, while still refusing to believe in superrationality.
Sorry I don’t have any helpful high-level comments, I think I don’t understand the general thrust of the research agenda well enough to know what next directions are useful.