[Edit: this isn’t actually a spurious counterfactual.] The agent might reason “if I two-box, then either it’s because I do something stupid (we can’t rule this out for Lobian reasons, but we should be able to assign it arbitrarily low probability), or, much more likely, the predictor’s reasoning is inconsistent. An inconsistent predictor would put $1M in box B no matter what my action is, so I can get $1,001,000 by two-boxing in this scenario. I am sufficiently confident in this model that my expected payoff conditional on me two-boxing is greater than $1M, whereas I can’t possibly get more than $1M if I one-box. Therefore I should two-box.” (this only happens if the predictor is implemented in such a way that it puts $1M in box B if it is inconsistent, of course). If the agent reasons this way, it would be wrong to trust itself with high probability, but we’d want the agent to be able to trust itself with high probability without being wrong.