Wei Dai comments on VDT: a solution to decision theory

Wei Dai 19 Apr 2025 23:34 UTC
7 points
0

“Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited”

Can you formalize this? In other words, do you have an algorithm for translating an arbitrary mind into a causal graph and then asking this question? Can you try it out on some simple minds, like GPT-2?

I suspect there may not be a simple/elegant/unique way of doing this, in which case the answer to the decision problem depends on the details of how exactly Omega is doing it. E.g., maybe all such algorithms are messy/heuristics based, and it makes sense to think a bit about whether you can trick the specific algorithm into giving a “wrong prediction” (in quotes because it’s not clear exactly what right and wrong even mean in this context) that benefits you, or maybe you have to self-modify into something Omega’s algorithm can recognize / work with, and it’s a messy cost-benefit analysis of whether this is worth doing, etc.
- Mikhail Samin 20 Apr 2025 8:39 UTC
  2 points
  0
  Parent
  I agree- it depends on what exactly Omega is doing. I can’t/haven’t tried to formalize this, this is more of a normative claim, but I imagine a vibes-based approach is to add a set of current beliefs about logic/maths or an external oracle to the inputs of FDT (or somehow find beliefs about maths into GPT-2), and in the situation where the input is “digit #3
  $\uparrow$
  $\uparrow$
  $\uparrow$
  3 of pi is odd” and FDT knows the digit is not adversarially selected, it knows it might currently be in the process of determining its outputs for a world that doesn’t exist/won’t happen.
  What exactly Omega is doing maybe changes the point at which you stop updating (i.e., maybe Omega edits all of your memory so you remember that pi has always started with 3.15 and makes everything that would normally causes you to believe that 2+2=4 cause you to believe that 2+2=3), but I imagine for the simple case of being told “if the digit #3
  $\uparrow$
  $\uparrow$
  $\uparrow$
  3 of pi is even, if I predicted that you’d give me $1 if it’s odd, I’d give you $10^100. Let me look it up now (I’ve not accessed it before!). It’s… 5”, you are updatefull up to the moment when Omega says what the digit is because this is where the divergence starts; and you simply pay.
- Knight Lee 20 Apr 2025 1:11 UTC
  0 points
  0
  Parent
  There was a math paper which tried to study logical causation, and claimed “we can imbue the impossible worlds with a sufficiently rich structure so that there are all kinds of inconsistent mathematical structures (which are more or less inconsistent, depending on how many contradictions they feature).”
  In the end, they didn’t find a way to formalize logical causality, and I suspect it cannot be formalized.
  Logical counterfactuals behave badly because “deductive explosion” allows a single contradiction to prove and disprove every possible statement!
  However, “deductive explosion” does not occur for a UDT agent trying to reason about logical counterfactuals where he outputs something different than what he actually outputs.
  This is because a computation cannot prove its own output.
  Why a computation cannot prove its own output
  If a computation could prove its own output, it could be programmed to output the opposite of what it proves it will output, which is paradoxical.
  This paradox doesn’t occur because a computation trying to prove its own output (and give the opposite output) will have to simulate itself. The simulation of itself starts another nested simulation of itself, creating an infinite recursion which never ends (the computation crashes before it can give any output).
  A computation’s output is logically downstream of it. The computation is not allowed to prove logical facts downstream from itself but it is allowed to decide logical facts downstream of itself.
  Therefore, very conveniently (and elegantly?), it avoids the “deductive explosion” problem.
  It’s almost as if… logic… deliberately conspired to make UDT feasible...?!
  - Mikhail Samin 21 Apr 2025 15:35 UTC
    7 points
    5
    Parent
    Yeah, from the claim that pi starts with two you can easily prove anything. But I think:
    (1) something like logical induction should somewhat help: maybe the agent doesn’t know whether some statement is true and isn’t going to run for long enough to start encounter contradictions.
    (2) Omega can also maybe intervene on the agent’s experience/knowledge of more accessible logical statements while leaving other things intact, sort of like making you experience what Eliezer describes here as convincing that 2+2=3: https://www.lesswrong.com/posts/6FmqiAgS8h4EJm86s/how-to-convince-me-that-2-2-3, and if that’s what it is doing, we should basically ignore our knowledge of maths for the purpose of thinking about logical counterfactuals.
    - Knight Lee 21 Apr 2025 19:11 UTC
      1 point
      0
      Parent
      I was thinking that deductive explosion occurs for logical counterfactuals encountered during counterfactual mugging, but doesn’t occur for logical counterfactuals encountered when a UDT agent merely considers what would happen if it outputs something else (as a logical computation).
      I agree that logical counterfactual mugging can work, just that it probably can’t be formalized, and may have an inevitable degree of subjectivity to it.
      Coincidentally, just a few days ago I wrote a post on how we can use logical counterfactual mugging to convince a misaligned superintelligence to give humans just a little, even if it observes the logical information that humans lose control every time (and therefore has nothing to trade with it), unless math and logic itself was different. :) leave a comment there if you have time, in my opinion it’s more interesting and concrete.
      - Mikhail Samin 23 Apr 2025 23:33 UTC
        2 points
        0
        Parent
        (MIRI did some work on logical induction.)
        I’ll give the post a read!
  - Mikhail Samin 21 Apr 2025 15:39 UTC
    5 points
    0
    Parent
    This paradox doesn’t occur because a computation trying to prove its own output (and give the opposite output) will have to simulate itself
    
    Due to Löb, if a computation knows that if it finds a proof that it outputs A, then it will output A, then it proves that it outputs A, without any need for recursion. This is why you really shouldn’t output something just because you’ve proved that you will.