You mean as a matter of practice? If I knew a TOE and the initial conditions of the universe, its possible that every question I care about is a mathematical one. So I care about the probabilities of mathematical questions to exactly the same extent that I care about probabilities of events, right?
Moreover, if I am dealing with another UDT agent, statements about their decisions are mathematical statements. So in the prisoner’s dilemma, for example, you are forced to bet on a statement like “The other player in this prisoner’s dilemma will cooperate”. As far as I can tell from a brief discussion with cousin_it, there is no clear way for an explicit reasoner to be able to prove such a statement with certainty (because of incompleteness issues).
So in the prisoner’s dilemma, for example, you are forced to bet on a statement like “The other player in this prisoner’s dilemma will cooperate”. As far as I can tell from a brief discussion with cousin_it, there is no clear way for an explicit reasoner to be able to prove such a statement with certainty (because of incompleteness issues).
This is technically wrong, see the posts linked from ADT wiki page. While you might be unable to prove what you’ll do, and what other agents whose actions depend on yours would do, you could prove such statements if you assume that you’ll perform a given possible action, which is how you decide which of the possible actions to pick.
So where you won’t be able to unconditionally simulate other agents, you might be able to simulate them conditionally on your own possible decision (with actual decision depending on the conclusions drawn from that simulation).
While you might be unable to prove what you’ll do, and what other agents whose actions depend on yours would do, you could prove such statements if you assume that you’ll perform a given possible action, which is how you decide which of the possible actions to pick.
Fine. You need to bet on statements of the form “If I cooperate, my opponent will cooperate.” The difference makes no difference to my post.
This is technically wrong, see the posts linked from ADT wiki page.
None of the posts linked to from the ADT page resolve my concern, which is this: if you have two ADT agents playing the prisoner’s dilemma, unless they use very similar theorem provers there is no way they can cooperate. This is a very simple problem, and you have basically no hope of cooperating in any non-trivial situation.
The problem seems to be that there is no way both ADT agents can prove that the others’ axioms are consistent (cousin_it raised the concern, but it also seems critical to me). So I can’t rule out the possibility that the other ADT agent will randomly prove something like “If I defect, I will get 100000 utility” and then defect regardless of what my action is.
The problem seems to be that there is no way both ADT agents can prove that the others’ axioms are consistent (cousin_it raised the concern, but it also seems critical to me).
This is not needed for the basic cooperation scenario discussed in those posts. It’s possible to prove that the other agent performs the same action as you.
(Similarly, if you’ve proved that 2+2=4, it can still be wrong if your axioms were inconsistent, but this is not a concern that can be addressed on the same level where you’re proving things like 2+2=4.)
So I can’t rule out the possibility that the other ADT agent will randomly prove something like “If I defect, I will get 100000 utility” and then defect regardless of what my action is.
Here is how I imagine ADT working. I search for proofs that [A = X ⇒ U >= Y], or something like that, and then output X that maximizes the best provable lower bound on my utility.
In this view, I do not see how to prove cooperation in any non-trivial situation whatsoever. Could you link to a particular argument which shows how to prove cooperation in some situation?
cousin_it’s AI cooperation in practice post shows how a ADT-like system can cooperate—one which looks explicitly for proofs of the form “The other guy will do what I do” and then if it finds one acts appropriately. Could you describe how you would generalize this ad-hoc solution to other cooperation problems? It seems like you really want the general version of ADT to be able to work on the prisoner’s dilemma, and I don’t see how it would work.
AI cooperation in practice gives a specific setup that allows cooperation. The post links to a sketch of a proof for AIs cooperating in the setup. Each of these AIs will do well against arbitrary programs. This is a counterexample to your statement that
So I can’t rule out the possibility that the other ADT agent will randomly prove something like “If I defect, I will get 100000 utility” and then defect regardless of what my action is.
The setup in AI cooperation in practice is a specific algorithm that looks like ADT and cooperates. I agree that this is very good to know, and I am familiar with the argument.
I would like to be able to prove that some sort of algorithms running a general form of ADT can cooperate on the prisoner’s dilemma. I think I am failing to communicate the distinction between this and cousin it’s result.
The issue is that the algorithms considered in cousin it’s result aren’t using ADT. They are looking for a proof of a particular statement, constructed in an ad-hoc way (how do you generalize the statement “Me and the other AI have the same output”?), and if they find it they cooperate. They are not looking for proofs of statements of the form “If I cooperate, then I will get a certain utility.” As far as I can tell, if you were to consider two algorithms actually using ADT, instead of the ADT-like algorithm cousin it describes, then they would not be able to find a proof of a statement of the form “My opponent will do the same thing I will.” The reason they cannot find such a proof is that they cannot exclude the possibility that the other AI will find a proof such as “By defecting I attain utility 100000.” Therefore they cannot prove, under any conditions, that the other AI will cooperate. Unless the other AI is using a very similar algorithm to search for proofs, knowledge of my own output does not help me prove that the other AI won’t find a proof of “By defecting I attain utility 100000.”
The description of what you are talking about helps.
This discussion is related to the problem that an ADT agent considering standard Newcomb’s problem could, in principle, find the following complete collection of moral arguments: [two-box ⇒ 1000; one-box ⇒ 0]. Since as a result, the agent would two-box, both statements are true and therefore permitted to be provable by agent’s inference system. We only expect that it’ll instead find the arguments [two-box ⇒ 1000; one-box ⇒ 1000000] out of some unstated assumptions about its inference system that have nothing to do with logical validity of the statements it proves.
And so, since the system is not fully specified by merely saying that it “implements ADT”, and its decisions depend on the missing details, we can’t prove that systems that “implement ADT” will in fact cooperate. Stating further assumptions is necessary.
(This is basically a proof that definition of ADT agents can’t be proven to imply their cooperation in PD.)
I see that being an ADT agent could never be enough to guarantee cooperation.
The thing I would like to see—the thing whose absence strikes me as a serious issue—is a generally applicable instantiation of ADT which cooperates with a slightly modified version of itself.
For example, consider the following agent. It knows its own source code. Two copies of it are then run, given large integer inputs N, M, and play the prisoner’s dilemma with each other. The first instance searches for proofs of length ⇐ N that “cooperating → utility >= A” or “defecting → utility >= B”. It then considers the best lower bound it has proved, and outputs the corresponding decision. The second instance does the same thing, but searching for proofs of length ⇐ M. It seems like if N, M are sufficiently large these two algorithms should cooperate, but they don’t.
Is there something confused about my desire for a “less ad hoc” agent which can cooperate on PD? Is there some reason that we should expect to be able to find ADT agents which cooperate in any reasonable situation without finding a less ad hoc algorithm that cooperates on PD? Is there something confused about my belief that inability to consistently assume consistency is the reason for this particular failure to cooperate? Is there likely to be some other way to get around the difficulty?
Edit: To make this more explicit, I feel intuitively that AIs in this situation should be able to cooperate, because I feel as though it is reasonable to make some probabilistic assumption of consistency. If you could formalize this notion, I suspect you would find reasonable agents who cooperate on PD. To me this seems like the easiest way to resolve the problem, because I don’t see where else there is to go.
It seems like if N, M are sufficiently large these two algorithms should cooperate, but they don’t.
Why not? My argument shows that they don’t necessarily cooperate, but I expect that they will cooperate in any natural implementation not specifically contrived to prevent that (although I don’t have a proof). You seem to expect to have shown that they necessarily don’t cooperate. If we specify ADT agents additionally, then they can of course know each other’s additional specification, and so know enough to conditionally prove each other’s cooperation. In this particular case, with proof length limits M and N, if they can both prove that faster than the lowest time limit, the fact that the time limits are different shouldn’t matter (other than as a technical issue in the proofs).
Is there some reason that we should expect to be able to find ADT agents which cooperate in any reasonable situation without finding a less ad hoc algorithm that cooperates on PD?
I expect most reasonable implementations of ADT agents with insane amount of computational resources to act correctly, for reasons analogous to those given in cousin_it’s proof (there are some simplifications in the proof that generalize better).
I feel intuitively that AIs in this situation should be able to cooperate, because I feel as though it is reasonable to make some probabilistic assumption of consistency.
I don’t see what consistency has to do with anything here. The problem with possible implementations of ADT agents I pointed out is not caused by unsoundness of their inference system with respect to the intended semantics. All the statements those agents prove are true.
I certainly haven’t proved that any particular reasonable ADT agents don’t cooperate. But I have convinced myself that cousin_it’s argument is unlikely to generalize; I don’t see how they could prove conditionally the other would coordinate, because if their axioms are inconsistent the guy with the longer proof length may, even if he searches for proofs in a very natural way, prove that defecting is unconditionally better than cooperating.
because if their axioms are inconsistent the guy with the longer proof length may, even if he searches for proofs in a very natural way, prove that defecting is unconditionally better than cooperating.
Again, that’s not the problem. ADT agents can just make the decision the moment they find a complete collection of moral arguments (i.e. a utility function, in the sense of my post), which means that if a given utility function composed of agent-provable moral arguments is found, the decision follows from that.
(If the moral arguments have the form [A=A1 ⇒ U=U1], then there can be only one such agent-provable moral argument for each A1, so once the agent found one for each A1, it can as well stop looking. Consequences appear consistent.)
Consider a pair (A,B) of provable (but not necessarily agent-provable) moral arguments, for example A=”one-box ⇒ utility=1000000“ and B=”two-box ⇒ utility=1000”. These are statements provable (true) in the intended semantics, which means that if an agent manages to prove B’=”two-box ⇒ utility=1001000“, and two-box as a result of finding (A,B’), its inference system must be inconsistent (or something). But there is a different way in which an agent can decide to two-box, not requiring an inconsistency. It can just find a moral argument A’=”one-box ⇒ utility=23” before finding A, and decide to choose two-boxing based on (A’,B). Both A’ and B would be true.
(I know I’m reiterating the same thing I said before, but you haven’t signaled that you’ve understood the argument, and you keep repeating something about inconsistency that I don’t understand the relevance of, without answering my questions about that.)
I now understand what you are saying; thank you for being patient.
What I am confused by now is your optimism. I presented an argument based on incompleteness which convinced me that reasonable ADT agents won’t cooperate. In response you pointed out that incompleteness isn’t really the problem—there are other failure modes anyway. So why is it that you believe “reasonable” ADT agents will in fact cooperate, when I (unaware of this other failure mode) already believed they wouldn’t?
Part of what convinces me that something bad is going on is the following. Consider an agent in Newcomb’s problem who searches exhaustively for a proof that two-boxing is better than one-boxing; if he finds one he will two-box, and if he can’t he will one-box by default. By an argument like the one given in AI coordination in practice, this agent will two-box.
Your optimism seems to depend heavily on the difference between “enumerating all proofs up to a certain length and looking for the best provable utility guarantees” and “enumerating proofs until you find a complete set of moral arguments, and behaving randomly if you can’t.” Why do you believe that a complete set of moral arguments is provable in reasonable situations? Do you know of some non-trivial example where this is really the case?
Your optimism seems to depend heavily on the difference between “enumerating all proofs up to a certain length and looking for the best provable utility guarantees” and “enumerating proofs until you find a complete set of moral arguments, and behaving randomly if you can’t.”
Yes, and this answers your preceding question:
In response you pointed out that incompleteness isn’t really the problem—there are other failure modes anyway.
The strategy of “enumerating proofs until you find a complete set of moral arguments” doesn’t suffer from the incompleteness issue (whatever it is, if it’s indeed there, which I doubt can have the simple form you referred to).
Why do you believe that a complete set of moral arguments is provable in reasonable situations?
I don’t believe it is provable in any reasonable time, but perhaps given enough time it can often be proven. Building a set of mathematical tools for reasoning about this might prove a fruitful exercise, but I have shelved this line of inquiry for the last few months, and wasn’t working on it.
None of the posts linked to from the ADT page resolve my concern, which is this: if you have two ADT agents playing the prisoner’s dilemma, unless they use very similar theorem provers there is no way they can cooperate.
My point was that in principle they can cooperate, and the reasons for difficulty in arranging cooperation in practice are not related to incompleteness. Considering conditional statements gets rid of the incompleteness-related problems.
You mean as a matter of practice? If I knew a TOE and the initial conditions of the universe, its possible that every question I care about is a mathematical one. So I care about the probabilities of mathematical questions to exactly the same extent that I care about probabilities of events, right?
Moreover, if I am dealing with another UDT agent, statements about their decisions are mathematical statements. So in the prisoner’s dilemma, for example, you are forced to bet on a statement like “The other player in this prisoner’s dilemma will cooperate”. As far as I can tell from a brief discussion with cousin_it, there is no clear way for an explicit reasoner to be able to prove such a statement with certainty (because of incompleteness issues).
This is technically wrong, see the posts linked from ADT wiki page. While you might be unable to prove what you’ll do, and what other agents whose actions depend on yours would do, you could prove such statements if you assume that you’ll perform a given possible action, which is how you decide which of the possible actions to pick.
So where you won’t be able to unconditionally simulate other agents, you might be able to simulate them conditionally on your own possible decision (with actual decision depending on the conclusions drawn from that simulation).
Fine. You need to bet on statements of the form “If I cooperate, my opponent will cooperate.” The difference makes no difference to my post.
None of the posts linked to from the ADT page resolve my concern, which is this: if you have two ADT agents playing the prisoner’s dilemma, unless they use very similar theorem provers there is no way they can cooperate. This is a very simple problem, and you have basically no hope of cooperating in any non-trivial situation.
The problem seems to be that there is no way both ADT agents can prove that the others’ axioms are consistent (cousin_it raised the concern, but it also seems critical to me). So I can’t rule out the possibility that the other ADT agent will randomly prove something like “If I defect, I will get 100000 utility” and then defect regardless of what my action is.
Do you know some way around this difficulty?
This is not needed for the basic cooperation scenario discussed in those posts. It’s possible to prove that the other agent performs the same action as you.
(Similarly, if you’ve proved that 2+2=4, it can still be wrong if your axioms were inconsistent, but this is not a concern that can be addressed on the same level where you’re proving things like 2+2=4.)
(You can, see the posts.)
Which post am I supposed to see?
Here is how I imagine ADT working. I search for proofs that [A = X ⇒ U >= Y], or something like that, and then output X that maximizes the best provable lower bound on my utility.
In this view, I do not see how to prove cooperation in any non-trivial situation whatsoever. Could you link to a particular argument which shows how to prove cooperation in some situation?
cousin_it’s AI cooperation in practice post shows how a ADT-like system can cooperate—one which looks explicitly for proofs of the form “The other guy will do what I do” and then if it finds one acts appropriately. Could you describe how you would generalize this ad-hoc solution to other cooperation problems? It seems like you really want the general version of ADT to be able to work on the prisoner’s dilemma, and I don’t see how it would work.
AI cooperation in practice gives a specific setup that allows cooperation. The post links to a sketch of a proof for AIs cooperating in the setup. Each of these AIs will do well against arbitrary programs. This is a counterexample to your statement that
What a reduction of “could” could look like gives a more flexible scheme that’s applicable to this case. Notion of Preference in Ambient Control gives a logical framework for describing the reasoning process of such agents.
It works in this special case. The special case shows how it could work. I make no further claims.
Sorry, I think we are talking across each other.
The setup in AI cooperation in practice is a specific algorithm that looks like ADT and cooperates. I agree that this is very good to know, and I am familiar with the argument.
I would like to be able to prove that some sort of algorithms running a general form of ADT can cooperate on the prisoner’s dilemma. I think I am failing to communicate the distinction between this and cousin it’s result.
The issue is that the algorithms considered in cousin it’s result aren’t using ADT. They are looking for a proof of a particular statement, constructed in an ad-hoc way (how do you generalize the statement “Me and the other AI have the same output”?), and if they find it they cooperate. They are not looking for proofs of statements of the form “If I cooperate, then I will get a certain utility.” As far as I can tell, if you were to consider two algorithms actually using ADT, instead of the ADT-like algorithm cousin it describes, then they would not be able to find a proof of a statement of the form “My opponent will do the same thing I will.” The reason they cannot find such a proof is that they cannot exclude the possibility that the other AI will find a proof such as “By defecting I attain utility 100000.” Therefore they cannot prove, under any conditions, that the other AI will cooperate. Unless the other AI is using a very similar algorithm to search for proofs, knowledge of my own output does not help me prove that the other AI won’t find a proof of “By defecting I attain utility 100000.”
The description of what you are talking about helps.
This discussion is related to the problem that an ADT agent considering standard Newcomb’s problem could, in principle, find the following complete collection of moral arguments: [two-box ⇒ 1000; one-box ⇒ 0]. Since as a result, the agent would two-box, both statements are true and therefore permitted to be provable by agent’s inference system. We only expect that it’ll instead find the arguments [two-box ⇒ 1000; one-box ⇒ 1000000] out of some unstated assumptions about its inference system that have nothing to do with logical validity of the statements it proves.
And so, since the system is not fully specified by merely saying that it “implements ADT”, and its decisions depend on the missing details, we can’t prove that systems that “implement ADT” will in fact cooperate. Stating further assumptions is necessary.
(This is basically a proof that definition of ADT agents can’t be proven to imply their cooperation in PD.)
I see that being an ADT agent could never be enough to guarantee cooperation.
The thing I would like to see—the thing whose absence strikes me as a serious issue—is a generally applicable instantiation of ADT which cooperates with a slightly modified version of itself.
For example, consider the following agent. It knows its own source code. Two copies of it are then run, given large integer inputs N, M, and play the prisoner’s dilemma with each other. The first instance searches for proofs of length ⇐ N that “cooperating → utility >= A” or “defecting → utility >= B”. It then considers the best lower bound it has proved, and outputs the corresponding decision. The second instance does the same thing, but searching for proofs of length ⇐ M. It seems like if N, M are sufficiently large these two algorithms should cooperate, but they don’t.
Is there something confused about my desire for a “less ad hoc” agent which can cooperate on PD? Is there some reason that we should expect to be able to find ADT agents which cooperate in any reasonable situation without finding a less ad hoc algorithm that cooperates on PD? Is there something confused about my belief that inability to consistently assume consistency is the reason for this particular failure to cooperate? Is there likely to be some other way to get around the difficulty?
Edit: To make this more explicit, I feel intuitively that AIs in this situation should be able to cooperate, because I feel as though it is reasonable to make some probabilistic assumption of consistency. If you could formalize this notion, I suspect you would find reasonable agents who cooperate on PD. To me this seems like the easiest way to resolve the problem, because I don’t see where else there is to go.
Why not? My argument shows that they don’t necessarily cooperate, but I expect that they will cooperate in any natural implementation not specifically contrived to prevent that (although I don’t have a proof). You seem to expect to have shown that they necessarily don’t cooperate. If we specify ADT agents additionally, then they can of course know each other’s additional specification, and so know enough to conditionally prove each other’s cooperation. In this particular case, with proof length limits M and N, if they can both prove that faster than the lowest time limit, the fact that the time limits are different shouldn’t matter (other than as a technical issue in the proofs).
I expect most reasonable implementations of ADT agents with insane amount of computational resources to act correctly, for reasons analogous to those given in cousin_it’s proof (there are some simplifications in the proof that generalize better).
I don’t see what consistency has to do with anything here. The problem with possible implementations of ADT agents I pointed out is not caused by unsoundness of their inference system with respect to the intended semantics. All the statements those agents prove are true.
I certainly haven’t proved that any particular reasonable ADT agents don’t cooperate. But I have convinced myself that cousin_it’s argument is unlikely to generalize; I don’t see how they could prove conditionally the other would coordinate, because if their axioms are inconsistent the guy with the longer proof length may, even if he searches for proofs in a very natural way, prove that defecting is unconditionally better than cooperating.
Again, that’s not the problem. ADT agents can just make the decision the moment they find a complete collection of moral arguments (i.e. a utility function, in the sense of my post), which means that if a given utility function composed of agent-provable moral arguments is found, the decision follows from that.
(If the moral arguments have the form [A=A1 ⇒ U=U1], then there can be only one such agent-provable moral argument for each A1, so once the agent found one for each A1, it can as well stop looking. Consequences appear consistent.)
Consider a pair (A,B) of provable (but not necessarily agent-provable) moral arguments, for example A=”one-box ⇒ utility=1000000“ and B=”two-box ⇒ utility=1000”. These are statements provable (true) in the intended semantics, which means that if an agent manages to prove B’=”two-box ⇒ utility=1001000“, and two-box as a result of finding (A,B’), its inference system must be inconsistent (or something). But there is a different way in which an agent can decide to two-box, not requiring an inconsistency. It can just find a moral argument A’=”one-box ⇒ utility=23” before finding A, and decide to choose two-boxing based on (A’,B). Both A’ and B would be true.
(I know I’m reiterating the same thing I said before, but you haven’t signaled that you’ve understood the argument, and you keep repeating something about inconsistency that I don’t understand the relevance of, without answering my questions about that.)
I now understand what you are saying; thank you for being patient.
What I am confused by now is your optimism. I presented an argument based on incompleteness which convinced me that reasonable ADT agents won’t cooperate. In response you pointed out that incompleteness isn’t really the problem—there are other failure modes anyway. So why is it that you believe “reasonable” ADT agents will in fact cooperate, when I (unaware of this other failure mode) already believed they wouldn’t?
Part of what convinces me that something bad is going on is the following. Consider an agent in Newcomb’s problem who searches exhaustively for a proof that two-boxing is better than one-boxing; if he finds one he will two-box, and if he can’t he will one-box by default. By an argument like the one given in AI coordination in practice, this agent will two-box.
Your optimism seems to depend heavily on the difference between “enumerating all proofs up to a certain length and looking for the best provable utility guarantees” and “enumerating proofs until you find a complete set of moral arguments, and behaving randomly if you can’t.” Why do you believe that a complete set of moral arguments is provable in reasonable situations? Do you know of some non-trivial example where this is really the case?
Yes, and this answers your preceding question:
The strategy of “enumerating proofs until you find a complete set of moral arguments” doesn’t suffer from the incompleteness issue (whatever it is, if it’s indeed there, which I doubt can have the simple form you referred to).
I don’t believe it is provable in any reasonable time, but perhaps given enough time it can often be proven. Building a set of mathematical tools for reasoning about this might prove a fruitful exercise, but I have shelved this line of inquiry for the last few months, and wasn’t working on it.
My point was that in principle they can cooperate, and the reasons for difficulty in arranging cooperation in practice are not related to incompleteness. Considering conditional statements gets rid of the incompleteness-related problems.