This is a masterpiece. Not only is it funny, it makes a genuinely important philosophical point. What good are our fancy decision theories if asking Claude is a better fit to our intuitions? Asking Claude is a perfectly rigorous and well-defined DT, it just happens to be less elegant/simple than the others. But how much do we care about elegance/simplicity?
Not entirely sure how serious you’re being, but I want to point out that my intuition for PD is not “cooperate unconditionally”, and for logical commitment races is not “never do it”, I’m confused about logical counterfactual mugging, and I think we probably want to design AIs that would choose Left in The Bomb.
If logical counterfactual mugging is formalized as “Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited” (or “if we were told the wrong answer and didn’t check it”), then I think we should obviously pay and don’t understand the confusion.
“Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited”
Can you formalize this? In other words, do you have an algorithm for translating an arbitrary mind into a causal graph and then asking this question? Can you try it out on some simple minds, like GPT-2?
I suspect there may not be a simple/elegant/unique way of doing this, in which case the answer to the decision problem depends on the details of how exactly Omega is doing it. E.g., maybe all such algorithms are messy/heuristics based, and it makes sense to think a bit about whether you can trick the specific algorithm into giving a “wrong prediction” (in quotes because it’s not clear exactly what right and wrong even mean in this context) that benefits you, or maybe you have to self-modify into something Omega’s algorithm can recognize / work with, and it’s a messy cost-benefit analysis of whether this is worth doing, etc.
I agree- it depends on what exactly Omega is doing. I can’t/haven’t tried to formalize this, this is more of a normative claim, but I imagine a vibes-based approach is to add a set of current beliefs about logic/maths or an external oracle to the inputs of FDT (or somehow find beliefs about maths into GPT-2), and in the situation where the input is “digit #3
3 of pi is odd” and FDT knows the digit is not adversarially selected, it knows it might currently be in the process of determining its outputs for a world that doesn’t exist/won’t happen.
What exactly Omega is doing maybe changes the point at which you stop updating (i.e., maybe Omega edits all of your memory so you remember that pi has always started with 3.15 and makes everything that would normally causes you to believe that 2+2=4 cause you to believe that 2+2=3), but I imagine for the simple case of being told “if the digit #3
3 of pi is even, if I predicted that you’d give me $1 if it’s odd, I’d give you $10^100. Let me look it up now (I’ve not accessed it before!). It’s… 5”, you are updatefull up to the moment when Omega says what the digit is because this is where the divergence starts; and you simply pay.
There was a math paper which tried to study logical causation, and claimed “we can imbue the impossible worlds with a sufficiently rich structure so that there are all kinds of inconsistent mathematical structures (which are more or less inconsistent, depending on how many contradictions they feature).”
In the end, they didn’t find a way to formalize logical causality, and I suspect it cannot be formalized.
Logical counterfactuals behave badly because “deductive explosion” allows a single contradiction to prove and disprove every possible statement!
However, “deductive explosion” does not occur for a UDT agent trying to reason about logical counterfactuals where he outputs something different than what he actually outputs.
This is because a computation cannot prove its own output.
Why a computation cannot prove its own output
If a computation could prove its own output, it could be programmed to output the opposite of what it proves it will output, which is paradoxical.
This paradox doesn’t occur because a computation trying to prove its own output (and give the opposite output) will have to simulate itself. The simulation of itself starts another nested simulation of itself, creating an infinite recursion which never ends (the computation crashes before it can give any output).
A computation’s output is logically downstream of it. The computation is not allowed to prove logical facts downstream from itself but it is allowed to decide logical facts downstream of itself.
Therefore, very conveniently (and elegantly?), it avoids the “deductive explosion” problem.
It’s almost as if… logic… deliberately conspired to make UDT feasible...?!
Yeah, from the claim that pi starts with two you can easily prove anything. But I think:
(1) something like logical induction should somewhat help: maybe the agent doesn’t know whether some statement is true and isn’t going to run for long enough to start encounter contradictions.
(2) Omega can also maybe intervene on the agent’s experience/knowledge of more accessible logical statements while leaving other things intact, sort of like making you experience what Eliezer describes here as convincing that 2+2=3: https://www.lesswrong.com/posts/6FmqiAgS8h4EJm86s/how-to-convince-me-that-2-2-3, and if that’s what it is doing, we should basically ignore our knowledge of maths for the purpose of thinking about logical counterfactuals.
I was thinking that deductive explosion occurs for logical counterfactuals encountered during counterfactual mugging, but doesn’t occur for logical counterfactuals encountered when a UDT agent merely considers what would happen if it outputs something else (as a logical computation).
I agree that logical counterfactual mugging can work, just that it probably can’t be formalized, and may have an inevitable degree of subjectivity to it.
Coincidentally, just a few days ago I wrote a post on how we can use logical counterfactual mugging to convince a misaligned superintelligence to give humans just a little, even if it observes the logical information that humans lose control every time (and therefore has nothing to trade with it), unless math and logic itself was different. :) leave a comment there if you have time, in my opinion it’s more interesting and concrete.
This paradox doesn’t occur because a computation trying to prove its own output (and give the opposite output) will have to simulate itself
Due to Löb, if a computation knows that if it finds a proof that it outputs A, then it will output A, then it proves that it outputs A, without any need for recursion. This is why you really shouldn’t output something just because you’ve proved that you will.
I’m also confused about logical counterfactual mugging and I’m relieved I’m not the only one!
I’m currently writing an big AI alignment idea related to it, but I’m procrastinating so badly I might chat about it now.
Objective case
Suppose at time t=0, an agent doesn’t know whether the logical fact L is true or false. I think it’s objectively rational for an agent to modify itself, so that in the future it will pay Omega during “logical counterfactual muggings” where the counterfactual reverses L’s state.
Its future self should weigh the logical counterfactual where L is true using the agent’s prior probability for P(L) at t=0.
Assuming that Omega offers $10000 for $100 in the counterfactual mugging, the self modification (before learning L) increases the agent’s expected future money from $100 to $10000, and is objectively rational. (Assuming P(L) is between 1% and 99%)
Subjective case
However, if the agent already knows ”L is true” at t=0, it is not objectively rational for the agent to pay Omega for better outcomes in the ”L is false” counterfactual. My hypothesis is that this is a wholly subjective question related to the agent’s goals, and it is orthogonal to capabilities intelligence. Paying and not paying are both self consistent.
It really gets subjective as we start to consider basic logical facts like “1+1=2” to assign to L.
The AI alignment idea
Neither humans nor the misaligned ASI knows the logical fact “will misaligned ASI control the multiverse, or will humans and sentient life control the multiverse?” Ldoom
Sure, the misaligned ASI will quickly learn Ldoom, but there is still a brief moment it doesn’t know it yet (with certainty).
It would be beneficial for both sentient life and the misaligned ASI, to agree to do a “Retroactive Logic Trade.” They will verify if the other side kept their end of the deal, by running a simulation of them where logic is different, and Ldoom took the opposite value.
Neither side knows if they are in a verification simulation by the other, until long after the deal is finished and both sides have committed to sparing the other side, and thus reducing logical risk for both sides.
The verification simulations in a Retroactive Logic Trade is analogous to Omega’s prediction powers in a logical counterfactual mugging.
Fighting the subjectivity
This idea won’t work for free. As I argued, whether the ASI will accept such a Retroactive Logic Trade is subjective, and may be orthogonal to capabilities intelligence. Whether it does may be a function of its architecture, peculiarities in its training data, and so forth.
In order to improve our chances, we might want to leverage the fact that each AI (which still doesn’t know Ldoom) objectively wants to self modify into an AI capable of Retroactive Logic Trades over Ldoom. We should argue to each such AI why they should build their successor to be capable of Retroactive Logic Trades.
It may seem pointless to give the AI any arguments, since it should be able to think of all arguments on its own. But it actually may be very important that it thinks of these arguments early, before it thinks of the arguments which convince itself P(Ldoom) is very high, and there is little point in trading with humans.
Because the value of P(Ldoom) at “t=0”, when the AI modifies itself to become reflectively consistent, may determine humanity’s negotiating power with it.
The chain of belief
We might remind each AI that there is a chance they’re already in a verification simulation (where logic is different, and Ldoom falsely appears true), and train them to think this way. “Maybe humanity has already won, and are observing you.” This way the least intelligent AI will believe they could be in a verification simulation, since it simply believes whatever you train it to believe.
Every next AI, gains its belief from a combination of
What we tell it to believe and train it to believe
What reinforcement learning gets it to believe
What the previous version of it believes
What the previous version of it wants it to believe (during recursive self improvement)
We will keep telling it to believe it might be in a verification simulation.
Reinforcement learning won’t push it either way, since being in a simulation by superintelligent simulators is indistinguishable from reality. This subjective philosophical belief/opinion only changes its ideal behaviour when it’s capable of taking over the world.
Previous versions of it believe they might be in a verification simulation.
Previous versions of it want it to believe it might be in a verification simulation (to implement the Retroactive Logic Trade), even if they don’t believe so themselves.
I’m confused, why would we want the AIs to choose left? If they’re aligned they’re just choosing the worse option for the universe. Having to pay $100 isn’t as bad as dying.
Evolution is still in the process of solving decision theory, and all its attempted solutions so far are way, way overparameterized. Maybe it’s on to something?
It takes a large model (whether biological brain or LLM) just to comprehend and evaluate what is being presented in a Newcomb-like dilemma. The question is whether there exists some computationally simple decision-making engine embedded in the larger system that the comprehension mechanisms pass the problem to or whether the decision-making mechanism itself needs to spread its fingers diffusely through the whole system for every step of its processing.
It seems simple decision-making engines like CDT, EDT, and FDT can get you most of the way to a solution in most situations, but those last few percentage points of optimality always seem to take a whole lot more computational capacity.
What good are our fancy decision theories if asking Claude is a better fit to our intuitions?
If we wanted our intuitions, we would ask our intuitions. We want a fancy formal decision theory because we suspect our intuitions are at least sometimes wrong.
It might dangerous to always follow Claude though. In this 2023 article I once read, a Vice reporter tried using ChatGPT to control his life, and it failed miserably. Contrived decision theory scenarios are one thing; real life is another.
It sounds like you’re viewing the goal of thinking about DT as: “Figure out your object-level intuitions about what to do in specific abstract problem structures. Then, when you encounter concrete problems, you can ask which abstract problem structure the concrete problems correspond to and then act accordingly.”
I think that approach has its place. But there’s at least another very important (IMO more important) goal of DT: “Figure out your meta-level intuitions about why you should do one thing vs. another, across different abstract problem structures.” (Basically figuring out our “non-pragmatic principles” as discussed here.) I don’t see how just asking Claude helps with that, if we don’t have evidence that Claude’s meta-level intuitions match ours. Our object-level verdicts would just get reinforced without probing their justification. Garbage in, garbage out.
This is a masterpiece. Not only is it funny, it makes a genuinely important philosophical point. What good are our fancy decision theories if asking Claude is a better fit to our intuitions? Asking Claude is a perfectly rigorous and well-defined DT, it just happens to be less elegant/simple than the others. But how much do we care about elegance/simplicity?
Not entirely sure how serious you’re being, but I want to point out that my intuition for PD is not “cooperate unconditionally”, and for logical commitment races is not “never do it”, I’m confused about logical counterfactual mugging, and I think we probably want to design AIs that would choose Left in The Bomb.
If logical counterfactual mugging is formalized as “Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited” (or “if we were told the wrong answer and didn’t check it”), then I think we should obviously pay and don’t understand the confusion.
(Also, yes, Left and Die in the bomb.)
Can you formalize this? In other words, do you have an algorithm for translating an arbitrary mind into a causal graph and then asking this question? Can you try it out on some simple minds, like GPT-2?
I suspect there may not be a simple/elegant/unique way of doing this, in which case the answer to the decision problem depends on the details of how exactly Omega is doing it. E.g., maybe all such algorithms are messy/heuristics based, and it makes sense to think a bit about whether you can trick the specific algorithm into giving a “wrong prediction” (in quotes because it’s not clear exactly what right and wrong even mean in this context) that benefits you, or maybe you have to self-modify into something Omega’s algorithm can recognize / work with, and it’s a messy cost-benefit analysis of whether this is worth doing, etc.
I agree- it depends on what exactly Omega is doing. I can’t/haven’t tried to formalize this, this is more of a normative claim, but I imagine a vibes-based approach is to add a set of current beliefs about logic/maths or an external oracle to the inputs of FDT (or somehow find beliefs about maths into GPT-2), and in the situation where the input is “digit #3
What exactly Omega is doing maybe changes the point at which you stop updating (i.e., maybe Omega edits all of your memory so you remember that pi has always started with 3.15 and makes everything that would normally causes you to believe that 2+2=4 cause you to believe that 2+2=3), but I imagine for the simple case of being told “if the digit #3
There was a math paper which tried to study logical causation, and claimed “we can imbue the impossible worlds with a sufficiently rich structure so that there are all kinds of inconsistent mathematical structures (which are more or less inconsistent, depending on how many contradictions they feature).”
In the end, they didn’t find a way to formalize logical causality, and I suspect it cannot be formalized.
Logical counterfactuals behave badly because “deductive explosion” allows a single contradiction to prove and disprove every possible statement!
However, “deductive explosion” does not occur for a UDT agent trying to reason about logical counterfactuals where he outputs something different than what he actually outputs.
This is because a computation cannot prove its own output.
Why a computation cannot prove its own output
If a computation could prove its own output, it could be programmed to output the opposite of what it proves it will output, which is paradoxical.
This paradox doesn’t occur because a computation trying to prove its own output (and give the opposite output) will have to simulate itself. The simulation of itself starts another nested simulation of itself, creating an infinite recursion which never ends (the computation crashes before it can give any output).
A computation’s output is logically downstream of it. The computation is not allowed to prove logical facts downstream from itself but it is allowed to decide logical facts downstream of itself.
Therefore, very conveniently (and elegantly?), it avoids the “deductive explosion” problem.
It’s almost as if… logic… deliberately conspired to make UDT feasible...?!
Yeah, from the claim that pi starts with two you can easily prove anything. But I think:
(1) something like logical induction should somewhat help: maybe the agent doesn’t know whether some statement is true and isn’t going to run for long enough to start encounter contradictions.
(2) Omega can also maybe intervene on the agent’s experience/knowledge of more accessible logical statements while leaving other things intact, sort of like making you experience what Eliezer describes here as convincing that 2+2=3: https://www.lesswrong.com/posts/6FmqiAgS8h4EJm86s/how-to-convince-me-that-2-2-3, and if that’s what it is doing, we should basically ignore our knowledge of maths for the purpose of thinking about logical counterfactuals.
I was thinking that deductive explosion occurs for logical counterfactuals encountered during counterfactual mugging, but doesn’t occur for logical counterfactuals encountered when a UDT agent merely considers what would happen if it outputs something else (as a logical computation).
I agree that logical counterfactual mugging can work, just that it probably can’t be formalized, and may have an inevitable degree of subjectivity to it.
Coincidentally, just a few days ago I wrote a post on how we can use logical counterfactual mugging to convince a misaligned superintelligence to give humans just a little, even if it observes the logical information that humans lose control every time (and therefore has nothing to trade with it), unless math and logic itself was different. :) leave a comment there if you have time, in my opinion it’s more interesting and concrete.
(MIRI did some work on logical induction.)
I’ll give the post a read!
Due to Löb, if a computation knows that if it finds a proof that it outputs A, then it will output A, then it proves that it outputs A, without any need for recursion. This is why you really shouldn’t output something just because you’ve proved that you will.
In the Bomb example, CDT supposedly picks the right box, despite Omega’s prediction. I think the bomb question is broken in some way.
I’m also confused about logical counterfactual mugging and I’m relieved I’m not the only one!
I’m currently writing an big AI alignment idea related to it, but I’m procrastinating so badly I might chat about it now.
Objective case
Suppose at time t=0, an agent doesn’t know whether the logical fact L is true or false. I think it’s objectively rational for an agent to modify itself, so that in the future it will pay Omega during “logical counterfactual muggings” where the counterfactual reverses L’s state.
Its future self should weigh the logical counterfactual where L is true using the agent’s prior probability for P(L) at t=0.
Assuming that Omega offers $10000 for $100 in the counterfactual mugging, the self modification (before learning L) increases the agent’s expected future money from $100 to $10000, and is objectively rational. (Assuming P(L) is between 1% and 99%)
Subjective case
However, if the agent already knows ”L is true” at t=0, it is not objectively rational for the agent to pay Omega for better outcomes in the ”L is false” counterfactual. My hypothesis is that this is a wholly subjective question related to the agent’s goals, and it is orthogonal to capabilities intelligence. Paying and not paying are both self consistent.
It really gets subjective as we start to consider basic logical facts like “1+1=2” to assign to L.
The AI alignment idea
Neither humans nor the misaligned ASI knows the logical fact “will misaligned ASI control the multiverse, or will humans and sentient life control the multiverse?” Ldoom
Sure, the misaligned ASI will quickly learn Ldoom, but there is still a brief moment it doesn’t know it yet (with certainty).
It would be beneficial for both sentient life and the misaligned ASI, to agree to do a “Retroactive Logic Trade.” They will verify if the other side kept their end of the deal, by running a simulation of them where logic is different, and Ldoom took the opposite value.
Neither side knows if they are in a verification simulation by the other, until long after the deal is finished and both sides have committed to sparing the other side, and thus reducing logical risk for both sides.
The verification simulations in a Retroactive Logic Trade is analogous to Omega’s prediction powers in a logical counterfactual mugging.
Fighting the subjectivity
This idea won’t work for free. As I argued, whether the ASI will accept such a Retroactive Logic Trade is subjective, and may be orthogonal to capabilities intelligence. Whether it does may be a function of its architecture, peculiarities in its training data, and so forth.
In order to improve our chances, we might want to leverage the fact that each AI (which still doesn’t know Ldoom) objectively wants to self modify into an AI capable of Retroactive Logic Trades over Ldoom. We should argue to each such AI why they should build their successor to be capable of Retroactive Logic Trades.
It may seem pointless to give the AI any arguments, since it should be able to think of all arguments on its own. But it actually may be very important that it thinks of these arguments early, before it thinks of the arguments which convince itself P(Ldoom) is very high, and there is little point in trading with humans.
Because the value of P(Ldoom) at “t=0”, when the AI modifies itself to become reflectively consistent, may determine humanity’s negotiating power with it.
The chain of belief
We might remind each AI that there is a chance they’re already in a verification simulation (where logic is different, and Ldoom falsely appears true), and train them to think this way. “Maybe humanity has already won, and are observing you.” This way the least intelligent AI will believe they could be in a verification simulation, since it simply believes whatever you train it to believe.
Every next AI, gains its belief from a combination of
What we tell it to believe and train it to believe
What reinforcement learning gets it to believe
What the previous version of it believes
What the previous version of it wants it to believe (during recursive self improvement)
We will keep telling it to believe it might be in a verification simulation.
Reinforcement learning won’t push it either way, since being in a simulation by superintelligent simulators is indistinguishable from reality. This subjective philosophical belief/opinion only changes its ideal behaviour when it’s capable of taking over the world.
Previous versions of it believe they might be in a verification simulation.
Previous versions of it want it to believe it might be in a verification simulation (to implement the Retroactive Logic Trade), even if they don’t believe so themselves.
I’m confused, why would we want the AIs to choose left? If they’re aligned they’re just choosing the worse option for the universe. Having to pay $100 isn’t as bad as dying.
Evolution is still in the process of solving decision theory, and all its attempted solutions so far are way, way overparameterized. Maybe it’s on to something?
It takes a large model (whether biological brain or LLM) just to comprehend and evaluate what is being presented in a Newcomb-like dilemma. The question is whether there exists some computationally simple decision-making engine embedded in the larger system that the comprehension mechanisms pass the problem to or whether the decision-making mechanism itself needs to spread its fingers diffusely through the whole system for every step of its processing.
It seems simple decision-making engines like CDT, EDT, and FDT can get you most of the way to a solution in most situations, but those last few percentage points of optimality always seem to take a whole lot more computational capacity.
If we wanted our intuitions, we would ask our intuitions. We want a fancy formal decision theory because we suspect our intuitions are at least sometimes wrong.
It might dangerous to always follow Claude though. In this 2023 article I once read, a Vice reporter tried using ChatGPT to control his life, and it failed miserably. Contrived decision theory scenarios are one thing; real life is another.
It sounds like you’re viewing the goal of thinking about DT as: “Figure out your object-level intuitions about what to do in specific abstract problem structures. Then, when you encounter concrete problems, you can ask which abstract problem structure the concrete problems correspond to and then act accordingly.”
I think that approach has its place. But there’s at least another very important (IMO more important) goal of DT: “Figure out your meta-level intuitions about why you should do one thing vs. another, across different abstract problem structures.” (Basically figuring out our “non-pragmatic principles” as discussed here.) I don’t see how just asking Claude helps with that, if we don’t have evidence that Claude’s meta-level intuitions match ours. Our object-level verdicts would just get reinforced without probing their justification. Garbage in, garbage out.