I don’t think Newcombe-like dilemmas are relevant for the reasoning of potentially dangerous AIs
When a programmer writes software, it’s because they have a prediction in mind about how the software is likely to behave in the future: we have goals we want software to achieve, and we write the code that we think will behave in the intended way. AGI systems are particularly likely to end up in Newcomblike scenarios if we build them to learn our values by reasoning about their programmers’ intentions and goals, or if the system constructs any intelligent subprocesses or subagents to execute tasks, or executes significant self-modifications at all. In the latter cases, the system itself is then in a position of designing reasoning algorithms based on predictions about how the algorithms will behave in the future.
The same principle holds if two agents are modeling each other in real time, as opposed to predicting a future agent; e.g., two copies of an AGI system, or subsystems of a single AGI system. The copies don’t have to be exact, and the systems don’t have to have direct access to each other’s source code, for the same issues to crop up.
One secondary motivation for TDT/UDT/FDT is a fallacious argument that it endorses cooperation in the true prisoner’s dilemma.
What’s the fallacy you’re claiming?
Informal arguments seem to be the load-bearing applying these theories to any particular problem; the technical works seem to be mainly formalizing narrow instances of these theories to agree with the informal intuition.
This seems wrong, if you’re saying that we can’t formally establish the behavior of different decision theories, or that applying theories to different cases requires ad-hoc emendations; see section 5 of “Functional Decision Theory” (and subsequent sections) for a comparison and step-by-step walkthrough of procedures for FDT, CDT, and EDT. One of the advantages we claim for FDT over CDT and EDT is that it doesn’t require ad-hoc tailoring for different dilemmas (e.g., ad-hoc precommitment methods or ratification procedures, or modifications to the agent’s prior).
I don’t know about FDT, but a fundamental assumption behind TDT and UDT
“UDT” is ambiguous and has been used to refer to a lot of different things, but Wei Dai’s original proposals of UDT are particular instances of FDT. FDT can be thought of as a generalization of Wei Dai’s first versions of UDT, that makes fewer commitments than Wei Dai’s particular approach.
but a fundamental assumption behind TDT and UDT is the existence of a causal structure behind logical statements, which sounds implausible to me.
None of the theories mentioned make any assumption like that; see the FDT paper above.
First, to be clear, I am referring to things such as this description of the prisoner’s dilemma and EY’s claim that TDT endorses cooperation. The published material has been careful to only say that these decision theories endorse cooperation among identical copies running the same source code, but as far as I can tell some researchers at MIRI still believe this stronger claim and this claim has been a major part of the public perception of these decision theories (example here; see section II).
The problem is that when two FDT agent with a different utility functions and different prior knowledge are facing a prisoner’s dilemma with each other, then their decisions are actually two different logical variables X0 and X1. The argument for cooperating is that X0 and X1 are sufficiently similar to one another that in the counterfactual where X0=C we also have X1=C. However, you could just as easily take the opposite premise, where X0 and X1 are sufficiently dissimilar that counterfactually changing X0 will have no effect on X1. Then you are left with the usual CDT analysis of the game. Given the vagueness of logical counterfactuals it is impossible to distinguish these two situations.
Here’s a related question: What does FDT say about the centipede game? There’s no symmetry between the players so I can’t just plug in the formalism. I don’t see how you can give an answer that’s in the spirit of cooperating in the prisoner’s dilemma without reaching the conclusion that FDT involves altruism among all FDT agents through some kind of veil of ignorance argument. And taking that conclusion is counter to the affine-transformation-invariance of utility functions.
“but a fundamental assumption behind TDT and UDT is the existence of a causal structure behind logical statements, which sounds implausible to me.”
None of the theories mentioned make any assumption like that; see the FDT paper above.
Page 14 of the FDT paper:
Instead of a do operator, FDT needs a true operator, which takes a logical sentence φ and updates P to represent the scenario where φ is true...
...Equation (4) works given a graph that accurately describes how changing the value of a logical variable affects other variables, but it is not yet clear how to construct such a thing—nor even whether it can be done in a satisfactory manner within Pearl’s framework.
This seems wrong, if you’re saying that we can’t formally establish the behavior of different decision theories, or that applying theories to different cases requires ad-hoc emendations; see section 5 of “Functional Decision Theory” (and subsequent sections) for a comparison and step-by-step walkthrough of procedures for FDT, CDT, and EDT. One of the advantages we claim for FDT over CDT and EDT is that it doesn’t require ad-hoc tailoring for different dilemmas (e.g., ad-hoc precommitment methods or ratification procedures, or modifications to the agent’s prior).
The main thing that distinguishes FDT from CDT is how the true operator mentioned above functions. As far as I’m aware this is always inserted by hand. This is easy to for situations where entities make perfect simulations of one another, but there aren’t even rough guidelines for what to do when the computations that are done cannot be delineated in such a clean manner. In addition, if this was a rich research field I would expect more “math that bites back”, i.e., substantive results that reduce to clearly-defined mathematical problems whose result wasn’t expected during the formalization.
This point about “load-bearing elements” is at its root an intuitive judgement that might be difficult for me to convey properly.
When a programmer writes software, it’s because they have a prediction in mind about how the software is likely to behave in the future: we have goals we want software to achieve, and we write the code that we think will behave in the intended way. AGI systems are particularly likely to end up in Newcomblike scenarios if we build them to learn our values by reasoning about their programmers’ intentions and goals, or if the system constructs any intelligent subprocesses or subagents to execute tasks, or executes significant self-modifications at all. In the latter cases, the system itself is then in a position of designing reasoning algorithms based on predictions about how the algorithms will behave in the future.
The same principle holds if two agents are modeling each other in real time, as opposed to predicting a future agent; e.g., two copies of an AGI system, or subsystems of a single AGI system. The copies don’t have to be exact, and the systems don’t have to have direct access to each other’s source code, for the same issues to crop up.
What’s the fallacy you’re claiming?
This seems wrong, if you’re saying that we can’t formally establish the behavior of different decision theories, or that applying theories to different cases requires ad-hoc emendations; see section 5 of “Functional Decision Theory” (and subsequent sections) for a comparison and step-by-step walkthrough of procedures for FDT, CDT, and EDT. One of the advantages we claim for FDT over CDT and EDT is that it doesn’t require ad-hoc tailoring for different dilemmas (e.g., ad-hoc precommitment methods or ratification procedures, or modifications to the agent’s prior).
“UDT” is ambiguous and has been used to refer to a lot of different things, but Wei Dai’s original proposals of UDT are particular instances of FDT. FDT can be thought of as a generalization of Wei Dai’s first versions of UDT, that makes fewer commitments than Wei Dai’s particular approach.
None of the theories mentioned make any assumption like that; see the FDT paper above.
First, to be clear, I am referring to things such as this description of the prisoner’s dilemma and EY’s claim that TDT endorses cooperation. The published material has been careful to only say that these decision theories endorse cooperation among identical copies running the same source code, but as far as I can tell some researchers at MIRI still believe this stronger claim and this claim has been a major part of the public perception of these decision theories (example here; see section II).
The problem is that when two FDT agent with a different utility functions and different prior knowledge are facing a prisoner’s dilemma with each other, then their decisions are actually two different logical variables X0 and X1. The argument for cooperating is that X0 and X1 are sufficiently similar to one another that in the counterfactual where X0=C we also have X1=C. However, you could just as easily take the opposite premise, where X0 and X1 are sufficiently dissimilar that counterfactually changing X0 will have no effect on X1. Then you are left with the usual CDT analysis of the game. Given the vagueness of logical counterfactuals it is impossible to distinguish these two situations.
Here’s a related question: What does FDT say about the centipede game? There’s no symmetry between the players so I can’t just plug in the formalism. I don’t see how you can give an answer that’s in the spirit of cooperating in the prisoner’s dilemma without reaching the conclusion that FDT involves altruism among all FDT agents through some kind of veil of ignorance argument. And taking that conclusion is counter to the affine-transformation-invariance of utility functions.
Page 14 of the FDT paper:
The main thing that distinguishes FDT from CDT is how the true operator mentioned above functions. As far as I’m aware this is always inserted by hand. This is easy to for situations where entities make perfect simulations of one another, but there aren’t even rough guidelines for what to do when the computations that are done cannot be delineated in such a clean manner. In addition, if this was a rich research field I would expect more “math that bites back”, i.e., substantive results that reduce to clearly-defined mathematical problems whose result wasn’t expected during the formalization.
This point about “load-bearing elements” is at its root an intuitive judgement that might be difficult for me to convey properly.