I suspect the difficulty you’ve hit here is that Eliezer’s theory (TDT and variants) involves consistent reasoning about counterpossible statements, not just counterfactual statements (“If my algorithm were to pick a in this situation, then my expected utility would be −10000000, so I won’t do that then”). I can see what he’s trying to achieve, but unfortunately I don’t know any way in standard Mathematics to reason consistently about inconsistencies, or (in Modal Logic) to analyze impossible possible worlds. So this looks like a major problem.
I’m also struck by the similarity to an argument for 2-boxing in Newcomb’s problem. Basically, the causal decision theorist knows he is a causal decision theorist, and knows that Omega knows that, so he proves a theorem to the effect that the opaque box won’t contain $1 million. And then having proved that, it’s blindingly obvious that he should 2-box (it’s $1000 or nothing). So he 2 boxes. Nothing inconsistent there, just tragic.
I hadn’t seen the cousin_it posts, thanks, though I’m reading them now.
One observation is that I thought myself of the “tweak” whereby if TDT, UDT or similar manages to prove that it will not perform an action a, then it immediately does perform an action a.
This at least prevents a sound decision theory finding two distinct proofs of the form “If my algorithm were to do a, then my utility would be 1000” and “If my algorithm were to do a, then my utility would be −1000″. However, a couple of reservations:
The tweak sounds pretty weird (if I’m president, and I prove to myself that I won’t push the nuclear button, then suddenly I will. Huh???)
If I’m trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can’t legitimately infer from A → B and A-> ~B that ~A. Indeed it’s not obvious that I can infer anything at all; it will depend on how badly my impossible worlds behave. Some models of impossible worlds are complete anarchy, with no logical relationships at all between propositions.
On the question about CDT, I hadn’t really thought about whether the agent was human or automated, or whether it has access to its own source-code or just a general self-awareness of its decision-making powers. To be honest, I don’t think it makes much difference; human agents who are utterly convinced of the correctness of CDT (and there are some of them) will reach much the same conclusion as the automated agent (I.e. that there just isn’t going to be $1 million in the opaque box, so don’t worry about it). And when noticing other agents who seem to be getting the $1 million, they’ll shrug and say “Oh dear, Omega is awarding irrationality, but so what? Even if I had switched to TDT or UDT instead of CDT, there would still be some possible Omega who’d penalise me for that switch, and I had no way of knowing in advance which Omega I was likely to meet; in fact I was pretty confident I wouldn’t meet any Omega ever. Can’t win them all...”
If I’m trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can’t legitimately infer from A → B and A-> ~B that ~A.
In our models so far, this isn’t a problem, you just use a factory-standard first order inference system. What do you mean by “can’t legitimately infer”? The worst that can happen is that you infer something misleading (but still valid), and the diagonal step/chicken rule is one way of ensuring that doesn’t happen.
Thanks for the pointers to these… I’ll probably move my comments to there. On your other point:
In our models so far, this isn’t a problem, you just use a factory-standard first order inference system.
and…
If I’m trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can’t legitimately infer from A → B and A-> ~B that ~A.
The inference rule A-> B, A->~B |- ~A is the familiar principle of proof by contradiction i.e. “reductio an absurdam” of the alternative. The difficulty I see is that you can’t really use reduction ad absurdam in an impossible worlds semantics, because the impossible worlds are—ahem—absurd. That’s kinda the point.
I haven’t read all the details of what you’re trying, but my suspicion is that using standard off-the-shelf logics won’t work here: you will at least need to limit the application of the inference rules e.g. so that they can be applied only up to a maximum finite number of times (and so the set of theorems is not closed under logical inference). Maybe you need some sort of belief semantics, where the “worlds” admitted into the belief-system are strictly a mixture of possible and impossible worlds, but the agent hasn’t realized yet that the impossible ones actually are impossible or will never realize it, so they “seem” to be possible. So he’s chugging along analysing those worlds using inference rules that will eventually hit a contradiction, but not yet. For “practical purposes” the system is consistent and won’t blow up. Though be very careful, because an inconsistent system with the rule (“If I prove I won’t do a, then do a”) could accidentally prove that it won’t nuke the planet and then (inconsistently) nuke the planet. Oops.
my suspicion is that using standard off-the-shelf logics won’t work here: you will at least need to limit the application of the inference rules e.g. so that they can be applied only up to a maximum finite number of times (and so the set of theorems is not closed under logical inference)
This happens automatically, because (1) only the statements contributing to the decision matter, and there’s a finite number of them, and (2) presence of the things like the diagonal step/chicken rule in the decision algorithm implies that inferring of absurdity doesn’t happen. So we can prove that it’s not the case that an agent can infer absurdity, even though it’s free to use any first order inference it wants, and even though absurdity does follow from the axioms (in the setting without provability oracle).
In the setting with provability oracle, agent’s algorithm is constructed in such a way that its axioms become sufficiently weak that the impossible counterfactuals (from the point of view of a stronger theory) remain consistent according to the theory used by the agent, and so in that setting the impossible worlds have actual models.
I suspect the difficulty you’ve hit here is that Eliezer’s theory (TDT and variants) involves consistent reasoning about counterpossible statements, not just counterfactual statements (“If my algorithm were to pick a in this situation, then my expected utility would be −10000000, so I won’t do that then”). I can see what he’s trying to achieve, but unfortunately I don’t know any way in standard Mathematics to reason consistently about inconsistencies, or (in Modal Logic) to analyze impossible possible worlds. So this looks like a major problem.
I’m also struck by the similarity to an argument for 2-boxing in Newcomb’s problem. Basically, the causal decision theorist knows he is a causal decision theorist, and knows that Omega knows that, so he proves a theorem to the effect that the opaque box won’t contain $1 million. And then having proved that, it’s blindingly obvious that he should 2-box (it’s $1000 or nothing). So he 2 boxes. Nothing inconsistent there, just tragic.
But one that seems to have multiple “lines of attacks”. Have you seen cousin_it’s recent posts in discussion for example?
What do you mean by “causal decision theorist”? Is it:
A decision theorist (somebody who studies decision theory) who, based on the arguments he has seen so far, thinks CDT is the right decision theory? Or
An AI that has access to its own source code and can see that it’s running CDT?
I hadn’t seen the cousin_it posts, thanks, though I’m reading them now.
One observation is that I thought myself of the “tweak” whereby if TDT, UDT or similar manages to prove that it will not perform an action a, then it immediately does perform an action a.
This at least prevents a sound decision theory finding two distinct proofs of the form “If my algorithm were to do a, then my utility would be 1000” and “If my algorithm were to do a, then my utility would be −1000″. However, a couple of reservations:
The tweak sounds pretty weird (if I’m president, and I prove to myself that I won’t push the nuclear button, then suddenly I will. Huh???)
If I’m trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can’t legitimately infer from A → B and A-> ~B that ~A. Indeed it’s not obvious that I can infer anything at all; it will depend on how badly my impossible worlds behave. Some models of impossible worlds are complete anarchy, with no logical relationships at all between propositions.
On the question about CDT, I hadn’t really thought about whether the agent was human or automated, or whether it has access to its own source-code or just a general self-awareness of its decision-making powers. To be honest, I don’t think it makes much difference; human agents who are utterly convinced of the correctness of CDT (and there are some of them) will reach much the same conclusion as the automated agent (I.e. that there just isn’t going to be $1 million in the opaque box, so don’t worry about it). And when noticing other agents who seem to be getting the $1 million, they’ll shrug and say “Oh dear, Omega is awarding irrationality, but so what? Even if I had switched to TDT or UDT instead of CDT, there would still be some possible Omega who’d penalise me for that switch, and I had no way of knowing in advance which Omega I was likely to meet; in fact I was pretty confident I wouldn’t meet any Omega ever. Can’t win them all...”
In our models so far, this isn’t a problem, you just use a factory-standard first order inference system. What do you mean by “can’t legitimately infer”? The worst that can happen is that you infer something misleading (but still valid), and the diagonal step/chicken rule is one way of ensuring that doesn’t happen.
More specifically, see the last 4 posts linked from the ADT wiki page:
A model of UDT with a halting oracle
Predictability of Decisions and the Diagonal Method
A model of UDT without proof limits
An example of self-fulfilling spurious proofs in UDT
Thanks for the pointers to these… I’ll probably move my comments to there. On your other point:
and…
The inference rule A-> B, A->~B |- ~A is the familiar principle of proof by contradiction i.e. “reductio an absurdam” of the alternative. The difficulty I see is that you can’t really use reduction ad absurdam in an impossible worlds semantics, because the impossible worlds are—ahem—absurd. That’s kinda the point.
I haven’t read all the details of what you’re trying, but my suspicion is that using standard off-the-shelf logics won’t work here: you will at least need to limit the application of the inference rules e.g. so that they can be applied only up to a maximum finite number of times (and so the set of theorems is not closed under logical inference). Maybe you need some sort of belief semantics, where the “worlds” admitted into the belief-system are strictly a mixture of possible and impossible worlds, but the agent hasn’t realized yet that the impossible ones actually are impossible or will never realize it, so they “seem” to be possible. So he’s chugging along analysing those worlds using inference rules that will eventually hit a contradiction, but not yet. For “practical purposes” the system is consistent and won’t blow up. Though be very careful, because an inconsistent system with the rule (“If I prove I won’t do a, then do a”) could accidentally prove that it won’t nuke the planet and then (inconsistently) nuke the planet. Oops.
This happens automatically, because (1) only the statements contributing to the decision matter, and there’s a finite number of them, and (2) presence of the things like the diagonal step/chicken rule in the decision algorithm implies that inferring of absurdity doesn’t happen. So we can prove that it’s not the case that an agent can infer absurdity, even though it’s free to use any first order inference it wants, and even though absurdity does follow from the axioms (in the setting without provability oracle).
In the setting with provability oracle, agent’s algorithm is constructed in such a way that its axioms become sufficiently weak that the impossible counterfactuals (from the point of view of a stronger theory) remain consistent according to the theory used by the agent, and so in that setting the impossible worlds have actual models.