The description of what you are talking about helps.
This discussion is related to the problem that an ADT agent considering standard Newcomb’s problem could, in principle, find the following complete collection of moral arguments: [two-box ⇒ 1000; one-box ⇒ 0]. Since as a result, the agent would two-box, both statements are true and therefore permitted to be provable by agent’s inference system. We only expect that it’ll instead find the arguments [two-box ⇒ 1000; one-box ⇒ 1000000] out of some unstated assumptions about its inference system that have nothing to do with logical validity of the statements it proves.
And so, since the system is not fully specified by merely saying that it “implements ADT”, and its decisions depend on the missing details, we can’t prove that systems that “implement ADT” will in fact cooperate. Stating further assumptions is necessary.
(This is basically a proof that definition of ADT agents can’t be proven to imply their cooperation in PD.)
I see that being an ADT agent could never be enough to guarantee cooperation.
The thing I would like to see—the thing whose absence strikes me as a serious issue—is a generally applicable instantiation of ADT which cooperates with a slightly modified version of itself.
For example, consider the following agent. It knows its own source code. Two copies of it are then run, given large integer inputs N, M, and play the prisoner’s dilemma with each other. The first instance searches for proofs of length ⇐ N that “cooperating → utility >= A” or “defecting → utility >= B”. It then considers the best lower bound it has proved, and outputs the corresponding decision. The second instance does the same thing, but searching for proofs of length ⇐ M. It seems like if N, M are sufficiently large these two algorithms should cooperate, but they don’t.
Is there something confused about my desire for a “less ad hoc” agent which can cooperate on PD? Is there some reason that we should expect to be able to find ADT agents which cooperate in any reasonable situation without finding a less ad hoc algorithm that cooperates on PD? Is there something confused about my belief that inability to consistently assume consistency is the reason for this particular failure to cooperate? Is there likely to be some other way to get around the difficulty?
Edit: To make this more explicit, I feel intuitively that AIs in this situation should be able to cooperate, because I feel as though it is reasonable to make some probabilistic assumption of consistency. If you could formalize this notion, I suspect you would find reasonable agents who cooperate on PD. To me this seems like the easiest way to resolve the problem, because I don’t see where else there is to go.
It seems like if N, M are sufficiently large these two algorithms should cooperate, but they don’t.
Why not? My argument shows that they don’t necessarily cooperate, but I expect that they will cooperate in any natural implementation not specifically contrived to prevent that (although I don’t have a proof). You seem to expect to have shown that they necessarily don’t cooperate. If we specify ADT agents additionally, then they can of course know each other’s additional specification, and so know enough to conditionally prove each other’s cooperation. In this particular case, with proof length limits M and N, if they can both prove that faster than the lowest time limit, the fact that the time limits are different shouldn’t matter (other than as a technical issue in the proofs).
Is there some reason that we should expect to be able to find ADT agents which cooperate in any reasonable situation without finding a less ad hoc algorithm that cooperates on PD?
I expect most reasonable implementations of ADT agents with insane amount of computational resources to act correctly, for reasons analogous to those given in cousin_it’s proof (there are some simplifications in the proof that generalize better).
I feel intuitively that AIs in this situation should be able to cooperate, because I feel as though it is reasonable to make some probabilistic assumption of consistency.
I don’t see what consistency has to do with anything here. The problem with possible implementations of ADT agents I pointed out is not caused by unsoundness of their inference system with respect to the intended semantics. All the statements those agents prove are true.
I certainly haven’t proved that any particular reasonable ADT agents don’t cooperate. But I have convinced myself that cousin_it’s argument is unlikely to generalize; I don’t see how they could prove conditionally the other would coordinate, because if their axioms are inconsistent the guy with the longer proof length may, even if he searches for proofs in a very natural way, prove that defecting is unconditionally better than cooperating.
because if their axioms are inconsistent the guy with the longer proof length may, even if he searches for proofs in a very natural way, prove that defecting is unconditionally better than cooperating.
Again, that’s not the problem. ADT agents can just make the decision the moment they find a complete collection of moral arguments (i.e. a utility function, in the sense of my post), which means that if a given utility function composed of agent-provable moral arguments is found, the decision follows from that.
(If the moral arguments have the form [A=A1 ⇒ U=U1], then there can be only one such agent-provable moral argument for each A1, so once the agent found one for each A1, it can as well stop looking. Consequences appear consistent.)
Consider a pair (A,B) of provable (but not necessarily agent-provable) moral arguments, for example A=”one-box ⇒ utility=1000000“ and B=”two-box ⇒ utility=1000”. These are statements provable (true) in the intended semantics, which means that if an agent manages to prove B’=”two-box ⇒ utility=1001000“, and two-box as a result of finding (A,B’), its inference system must be inconsistent (or something). But there is a different way in which an agent can decide to two-box, not requiring an inconsistency. It can just find a moral argument A’=”one-box ⇒ utility=23” before finding A, and decide to choose two-boxing based on (A’,B). Both A’ and B would be true.
(I know I’m reiterating the same thing I said before, but you haven’t signaled that you’ve understood the argument, and you keep repeating something about inconsistency that I don’t understand the relevance of, without answering my questions about that.)
I now understand what you are saying; thank you for being patient.
What I am confused by now is your optimism. I presented an argument based on incompleteness which convinced me that reasonable ADT agents won’t cooperate. In response you pointed out that incompleteness isn’t really the problem—there are other failure modes anyway. So why is it that you believe “reasonable” ADT agents will in fact cooperate, when I (unaware of this other failure mode) already believed they wouldn’t?
Part of what convinces me that something bad is going on is the following. Consider an agent in Newcomb’s problem who searches exhaustively for a proof that two-boxing is better than one-boxing; if he finds one he will two-box, and if he can’t he will one-box by default. By an argument like the one given in AI coordination in practice, this agent will two-box.
Your optimism seems to depend heavily on the difference between “enumerating all proofs up to a certain length and looking for the best provable utility guarantees” and “enumerating proofs until you find a complete set of moral arguments, and behaving randomly if you can’t.” Why do you believe that a complete set of moral arguments is provable in reasonable situations? Do you know of some non-trivial example where this is really the case?
Your optimism seems to depend heavily on the difference between “enumerating all proofs up to a certain length and looking for the best provable utility guarantees” and “enumerating proofs until you find a complete set of moral arguments, and behaving randomly if you can’t.”
Yes, and this answers your preceding question:
In response you pointed out that incompleteness isn’t really the problem—there are other failure modes anyway.
The strategy of “enumerating proofs until you find a complete set of moral arguments” doesn’t suffer from the incompleteness issue (whatever it is, if it’s indeed there, which I doubt can have the simple form you referred to).
Why do you believe that a complete set of moral arguments is provable in reasonable situations?
I don’t believe it is provable in any reasonable time, but perhaps given enough time it can often be proven. Building a set of mathematical tools for reasoning about this might prove a fruitful exercise, but I have shelved this line of inquiry for the last few months, and wasn’t working on it.
The description of what you are talking about helps.
This discussion is related to the problem that an ADT agent considering standard Newcomb’s problem could, in principle, find the following complete collection of moral arguments: [two-box ⇒ 1000; one-box ⇒ 0]. Since as a result, the agent would two-box, both statements are true and therefore permitted to be provable by agent’s inference system. We only expect that it’ll instead find the arguments [two-box ⇒ 1000; one-box ⇒ 1000000] out of some unstated assumptions about its inference system that have nothing to do with logical validity of the statements it proves.
And so, since the system is not fully specified by merely saying that it “implements ADT”, and its decisions depend on the missing details, we can’t prove that systems that “implement ADT” will in fact cooperate. Stating further assumptions is necessary.
(This is basically a proof that definition of ADT agents can’t be proven to imply their cooperation in PD.)
I see that being an ADT agent could never be enough to guarantee cooperation.
The thing I would like to see—the thing whose absence strikes me as a serious issue—is a generally applicable instantiation of ADT which cooperates with a slightly modified version of itself.
For example, consider the following agent. It knows its own source code. Two copies of it are then run, given large integer inputs N, M, and play the prisoner’s dilemma with each other. The first instance searches for proofs of length ⇐ N that “cooperating → utility >= A” or “defecting → utility >= B”. It then considers the best lower bound it has proved, and outputs the corresponding decision. The second instance does the same thing, but searching for proofs of length ⇐ M. It seems like if N, M are sufficiently large these two algorithms should cooperate, but they don’t.
Is there something confused about my desire for a “less ad hoc” agent which can cooperate on PD? Is there some reason that we should expect to be able to find ADT agents which cooperate in any reasonable situation without finding a less ad hoc algorithm that cooperates on PD? Is there something confused about my belief that inability to consistently assume consistency is the reason for this particular failure to cooperate? Is there likely to be some other way to get around the difficulty?
Edit: To make this more explicit, I feel intuitively that AIs in this situation should be able to cooperate, because I feel as though it is reasonable to make some probabilistic assumption of consistency. If you could formalize this notion, I suspect you would find reasonable agents who cooperate on PD. To me this seems like the easiest way to resolve the problem, because I don’t see where else there is to go.
Why not? My argument shows that they don’t necessarily cooperate, but I expect that they will cooperate in any natural implementation not specifically contrived to prevent that (although I don’t have a proof). You seem to expect to have shown that they necessarily don’t cooperate. If we specify ADT agents additionally, then they can of course know each other’s additional specification, and so know enough to conditionally prove each other’s cooperation. In this particular case, with proof length limits M and N, if they can both prove that faster than the lowest time limit, the fact that the time limits are different shouldn’t matter (other than as a technical issue in the proofs).
I expect most reasonable implementations of ADT agents with insane amount of computational resources to act correctly, for reasons analogous to those given in cousin_it’s proof (there are some simplifications in the proof that generalize better).
I don’t see what consistency has to do with anything here. The problem with possible implementations of ADT agents I pointed out is not caused by unsoundness of their inference system with respect to the intended semantics. All the statements those agents prove are true.
I certainly haven’t proved that any particular reasonable ADT agents don’t cooperate. But I have convinced myself that cousin_it’s argument is unlikely to generalize; I don’t see how they could prove conditionally the other would coordinate, because if their axioms are inconsistent the guy with the longer proof length may, even if he searches for proofs in a very natural way, prove that defecting is unconditionally better than cooperating.
Again, that’s not the problem. ADT agents can just make the decision the moment they find a complete collection of moral arguments (i.e. a utility function, in the sense of my post), which means that if a given utility function composed of agent-provable moral arguments is found, the decision follows from that.
(If the moral arguments have the form [A=A1 ⇒ U=U1], then there can be only one such agent-provable moral argument for each A1, so once the agent found one for each A1, it can as well stop looking. Consequences appear consistent.)
Consider a pair (A,B) of provable (but not necessarily agent-provable) moral arguments, for example A=”one-box ⇒ utility=1000000“ and B=”two-box ⇒ utility=1000”. These are statements provable (true) in the intended semantics, which means that if an agent manages to prove B’=”two-box ⇒ utility=1001000“, and two-box as a result of finding (A,B’), its inference system must be inconsistent (or something). But there is a different way in which an agent can decide to two-box, not requiring an inconsistency. It can just find a moral argument A’=”one-box ⇒ utility=23” before finding A, and decide to choose two-boxing based on (A’,B). Both A’ and B would be true.
(I know I’m reiterating the same thing I said before, but you haven’t signaled that you’ve understood the argument, and you keep repeating something about inconsistency that I don’t understand the relevance of, without answering my questions about that.)
I now understand what you are saying; thank you for being patient.
What I am confused by now is your optimism. I presented an argument based on incompleteness which convinced me that reasonable ADT agents won’t cooperate. In response you pointed out that incompleteness isn’t really the problem—there are other failure modes anyway. So why is it that you believe “reasonable” ADT agents will in fact cooperate, when I (unaware of this other failure mode) already believed they wouldn’t?
Part of what convinces me that something bad is going on is the following. Consider an agent in Newcomb’s problem who searches exhaustively for a proof that two-boxing is better than one-boxing; if he finds one he will two-box, and if he can’t he will one-box by default. By an argument like the one given in AI coordination in practice, this agent will two-box.
Your optimism seems to depend heavily on the difference between “enumerating all proofs up to a certain length and looking for the best provable utility guarantees” and “enumerating proofs until you find a complete set of moral arguments, and behaving randomly if you can’t.” Why do you believe that a complete set of moral arguments is provable in reasonable situations? Do you know of some non-trivial example where this is really the case?
Yes, and this answers your preceding question:
The strategy of “enumerating proofs until you find a complete set of moral arguments” doesn’t suffer from the incompleteness issue (whatever it is, if it’s indeed there, which I doubt can have the simple form you referred to).
I don’t believe it is provable in any reasonable time, but perhaps given enough time it can often be proven. Building a set of mathematical tools for reasoning about this might prove a fruitful exercise, but I have shelved this line of inquiry for the last few months, and wasn’t working on it.