Hi. Found the site about a week ago. I read the TDT paper and was intrigued enough to start poring through Eliezer’s old posts. I’ve been working my way through the sequences and following backlinks. The material on rationality has helped me reconstruct my brain after a Halt, Melt and Catch Fire event. Good stuff.
I observe that comments on old posts are welcome, and I notice no one has yet come back to this post with the full formal solution for this dilemma since the publication of TDT. So here it is.
Whatever our opponent’s decision algorithm may be, it will either depend to some degree on a prediction of our behavior, or it will not. It can only rationally base its decision on a prediction of our behavior to the extent that it believes a) we will attempt to predict its own behavior; and b) we will only cooperate to the extent that we believe it will cooperate. It will thus be incentivized to cooperate to the extent that it believes we can and will successfully condition our behavior on its own. To the extent that it chooses independently of any prediction of our behavior, its only rational choice is to defect. Any other choices it could make will do worse than the above decisions in all cases, and the following strategy will gain extra utility against any such suboptimal choices, as will become clear.
There are thus two unknown probabilities for us to condition on: The probability that the opponent will choose to cooperate iff it believes we will cooperate, which I’ll call P(c), and the probability that the opponent will be able to successfully predict our action, which I’ll call P(p).
We want to calculate the utility of cooperating, u(C), and the utility of defecting, u(D), for each relevant case. So we shut up and multiply.
If the opponent is uncooperative (~c), they always defect. Thus u(C|~c) = 0 and u(D|~c) = 1.
In cases where a potentially cooperative opponent successfully predicts our action, we have u(C|c,p) = 2 and u(D|c,p) = 1. When such an opponent guesses our action incorrectly, we have u(C|c,~p) = 0 and u(D|c,~p) = 3.
We consider the one-shot dilemma first. An intelligent opponent can be assumed to have behavioral predictive capabilities at least better than chance (P(p) > 0.5), and perhaps approaching perfection (P(p) ~ 1) if it is a superintelligence. In the worst case, u(C) ~ P(c), and u(D) ~ 1 + P(c), and we should certainly defect. In the best case, u(C) ~ 2 * P(c) and u(D) ~ 1, so we should defect if P(c) < 0.5, that is, if we assess that our opponent is even slightly more likely to automatically defect than to consider cooperation. If we have optimistic priors for both probabilities due to applicable previous experiences or any immediate observational cues, we may choose to cooperate; we plug in our numbers, and shut up and multiply.
In the iterated case, we have the opportunity to observe our opponent’s behavior and update priors as we go. We are incentivized to cooperate when we believe it will do so, and to defect when we believe it will defect or when we believe we can do so without it anticipating us. Both players are incentivized to cooperate more often than defecting when they believe the other is good at predicting them. A player with a dominating edge in predictive capabilities can potentially attain a better result than pure mutual cooperation against an opponent with weak capabilities, through occasional strategic defections; the weaker player may find themselves incentivized not to punish the defector if they realize that they cannot do so without being anticipated and losing just as many utilons as the superior player would lose from the punishment. To the extent that the superior predictor can ascertain that their opponent is savvy enough to know when it’s dominated and would choose not to lose further utilons through vindictive play, such a strategy may be profitable.
Thus the spoils go to the algorithm with the best ability to predict an opponent. Skilled poker players or experts at “Rock-Paper-Scissors” could perform quite well in such contests against the average human. That could be fun to watch.
I can certainly empathize with that statement. And if my opponent is not only dominating in ability but exploiting that advantage to the point where I’m losing just as much by submitting as I would by exacting punishment, then that’s the tipping point where I start hitting back. Of course, I’d attempt retaliatory behavior initially when I was unsure how dominated I was, as well, but once I know that the opponent is just that much better than me, and as long as they’re not abusing that advantage to the point where retaliation becomes cost-effective, then I’d have to concede my opponent’s superiority, grit my teeth, bend over, and take one for the team. Especially with a 1 million human lives per util ratio. With lives at stake, I shut up and multiply.
I meant that as a rational strategy—if my opponent can predict that I’ll cooperate until defected upon, at which point I will “tear off the steering wheel and drink a fifth of vodka,” and start playing defect-only, his optimal play will not involve strategically chosen defections.
I was thrown off by the word “precommit”, which implies a reflectively inconsistent strategy, which is TDT-anathema. On the other hand, rational agents win, so having that strategy does make sense in that case, despite the fact that we might incur negative utility relative to playing submissively if we had to actually carry it out.
The solution, I think, is to be “the type of agent who would be ruthlessly vindictive against opponents who have enough predictive capability to see that I’m this type of agent, and enough strategic capability to accept that this means they gain nothing by defecting against me.” That makes it a reflectively consistent part of a decision theory, by keeping the negative-utility behavior in the realm of the pure counterfactual. As long as you know that having that strategy will effectively deter the other player, I think it can work.
And if not, or if I’ve made an error in some detail of my reasoning of how to make it work, I’m fairly confident at this point that an ideal TDT-agent could find a valid way to address the problem case in a reflectively consistent and strategically sound manner.
Hi. Found the site about a week ago. I read the TDT paper and was intrigued enough to start poring through Eliezer’s old posts. I’ve been working my way through the sequences and following backlinks. The material on rationality has helped me reconstruct my brain after a Halt, Melt and Catch Fire event. Good stuff.
I observe that comments on old posts are welcome, and I notice no one has yet come back to this post with the full formal solution for this dilemma since the publication of TDT. So here it is.
Whatever our opponent’s decision algorithm may be, it will either depend to some degree on a prediction of our behavior, or it will not. It can only rationally base its decision on a prediction of our behavior to the extent that it believes a) we will attempt to predict its own behavior; and b) we will only cooperate to the extent that we believe it will cooperate. It will thus be incentivized to cooperate to the extent that it believes we can and will successfully condition our behavior on its own. To the extent that it chooses independently of any prediction of our behavior, its only rational choice is to defect. Any other choices it could make will do worse than the above decisions in all cases, and the following strategy will gain extra utility against any such suboptimal choices, as will become clear.
There are thus two unknown probabilities for us to condition on: The probability that the opponent will choose to cooperate iff it believes we will cooperate, which I’ll call P(c), and the probability that the opponent will be able to successfully predict our action, which I’ll call P(p).
We want to calculate the utility of cooperating, u(C), and the utility of defecting, u(D), for each relevant case. So we shut up and multiply.
If the opponent is uncooperative (~c), they always defect. Thus u(C|~c) = 0 and u(D|~c) = 1.
In cases where a potentially cooperative opponent successfully predicts our action, we have u(C|c,p) = 2 and u(D|c,p) = 1. When such an opponent guesses our action incorrectly, we have u(C|c,~p) = 0 and u(D|c,~p) = 3.
Thus we have:
u(C) = 2 P(c) P(p)
u(D) = P(~c) + P(c) P(p) + 3 P(c) P(~p) = 1 - P(c) + P(c) P(p) + 3 P(c) (1 - P(p)) = 1 + 2 P(c) − 2 P(c) * P(p)
We consider the one-shot dilemma first. An intelligent opponent can be assumed to have behavioral predictive capabilities at least better than chance (P(p) > 0.5), and perhaps approaching perfection (P(p) ~ 1) if it is a superintelligence. In the worst case, u(C) ~ P(c), and u(D) ~ 1 + P(c), and we should certainly defect. In the best case, u(C) ~ 2 * P(c) and u(D) ~ 1, so we should defect if P(c) < 0.5, that is, if we assess that our opponent is even slightly more likely to automatically defect than to consider cooperation. If we have optimistic priors for both probabilities due to applicable previous experiences or any immediate observational cues, we may choose to cooperate; we plug in our numbers, and shut up and multiply.
In the iterated case, we have the opportunity to observe our opponent’s behavior and update priors as we go. We are incentivized to cooperate when we believe it will do so, and to defect when we believe it will defect or when we believe we can do so without it anticipating us. Both players are incentivized to cooperate more often than defecting when they believe the other is good at predicting them. A player with a dominating edge in predictive capabilities can potentially attain a better result than pure mutual cooperation against an opponent with weak capabilities, through occasional strategic defections; the weaker player may find themselves incentivized not to punish the defector if they realize that they cannot do so without being anticipated and losing just as many utilons as the superior player would lose from the punishment. To the extent that the superior predictor can ascertain that their opponent is savvy enough to know when it’s dominated and would choose not to lose further utilons through vindictive play, such a strategy may be profitable.
Thus the spoils go to the algorithm with the best ability to predict an opponent. Skilled poker players or experts at “Rock-Paper-Scissors” could perform quite well in such contests against the average human. That could be fun to watch.
Nice analysis. One small tweak: I would precommit to being vindictive as hell if I believe I’m dominated by my opponent in modeling capability.
I can certainly empathize with that statement. And if my opponent is not only dominating in ability but exploiting that advantage to the point where I’m losing just as much by submitting as I would by exacting punishment, then that’s the tipping point where I start hitting back. Of course, I’d attempt retaliatory behavior initially when I was unsure how dominated I was, as well, but once I know that the opponent is just that much better than me, and as long as they’re not abusing that advantage to the point where retaliation becomes cost-effective, then I’d have to concede my opponent’s superiority, grit my teeth, bend over, and take one for the team. Especially with a 1 million human lives per util ratio. With lives at stake, I shut up and multiply.
I meant that as a rational strategy—if my opponent can predict that I’ll cooperate until defected upon, at which point I will “tear off the steering wheel and drink a fifth of vodka,” and start playing defect-only, his optimal play will not involve strategically chosen defections.
You know, you’re right.
I was thrown off by the word “precommit”, which implies a reflectively inconsistent strategy, which is TDT-anathema. On the other hand, rational agents win, so having that strategy does make sense in that case, despite the fact that we might incur negative utility relative to playing submissively if we had to actually carry it out.
The solution, I think, is to be “the type of agent who would be ruthlessly vindictive against opponents who have enough predictive capability to see that I’m this type of agent, and enough strategic capability to accept that this means they gain nothing by defecting against me.” That makes it a reflectively consistent part of a decision theory, by keeping the negative-utility behavior in the realm of the pure counterfactual. As long as you know that having that strategy will effectively deter the other player, I think it can work.
And if not, or if I’ve made an error in some detail of my reasoning of how to make it work, I’m fairly confident at this point that an ideal TDT-agent could find a valid way to address the problem case in a reflectively consistent and strategically sound manner.