Typo: in the first full paragraph of page 2, I assume you mean the agent will one-box, not two-box.

Yes, thanks for the correction. I’d fix it, but I don’t think it’s possible to edit a pdf in google drive, and it’t not worth re-uploading and posting a new link for a typo.

And I’m not sure the final algorithm necessarily one-boxes even if the logical uncertainty engine thinks the predictor’s (stronger) axioms are probably consistent- I think there might be a spurious counterfactual where the conditional utilities view the agent two-boxing as evidence that the predictor’s axioms must be inconsistent. Is there a clean proof that the algorithm does the correct thing in this case?

I don’t have such a proof. I mentioned that as a possible concern at the end of the second-last paragraph of the section on the predictor having stronger logic and more computing power. Reconsidering though, this seems like a more serious concern than I initially imagined. It seems this will behave reasonably only when the agent does not trust itself too much, which would have terrible consequences for problems involving sequential decision-making.

Ideally, we’d want to replace the conditional expected value function with something of a more counterfactual nature to avoid these sorts of issues, but I don’t have a coherent way of specifying what that would even mean.

Yes, thanks for the correction. I’d fix it, but I don’t think it’s possible to edit a pdf in google drive, and it’t not worth re-uploading and posting a new link for a typo.

I don’t have such a proof. I mentioned that as a possible concern at the end of the second-last paragraph of the section on the predictor having stronger logic and more computing power. Reconsidering though, this seems like a more serious concern than I initially imagined. It seems this will behave reasonably only when the agent does not trust itself too much, which would have terrible consequences for problems involving sequential decision-making.

Ideally, we’d want to replace the conditional expected value function with something of a more counterfactual nature to avoid these sorts of issues, but I don’t have a coherent way of specifying what that would even mean.