Motivating Optimization Processes

Related to: Shut up and do the Impossible! The Hidden Complexity of Wishes. What can you do with an Unfriendly AI?

Suppose you find yourself in the following situation. There is a process, call it X, in a box. It knows a lot about the current state of the universe, but it can influence the rest of the world only through a single channel, through which it sends a single bit exactly once (at a predetermined time). If it sends 1 (cooperates), then nothing happens—humanity is free to go about its business. If it sends 0 (defects), then in one month a powerful uFAI is released which can take over the universe.

The question is, when can we count on X to cooperate? If X is friendly, then it seems like it should cooperate. Is designing an AGI which can be incentivized to cooperate any easier than designing a completely friendly AGI? It might be easier for two reasons. First, the AI just needs to prefer human survival without intervention to a particular catastrophic intervention. We don’t need to guarantee that its favorite outcome isn’t catastrophic in some other way. Second, the humans have some time to punish or reward the AI based on its behavior. In general, lets call a process X slightly friendly if it can be incentivized to cooperate in reasonable instantiations of this hypothetical (ie, reasonable worlds satisfying the properties I have laid out).

I ask this question because it seems much simpler to think about than friendliness (or AI boxing) but still confuses me badly—this post has no hope of answering this question, just clarifying some issues surrounding it. If it turns out that the design of slightly friendly AIs is no easier than the design of friendly AIs, then we have conclusive evidence that boxing an AI is not helpful for obtaining friendliness. If it turns out that the design of slightly friendly AIs is significantly easier, then this is a good first step towards resolving the legitimate objections raised in response to my previous post. (Eventually if we want to implement a scheme like the one I proposed we will need to get stronger guarantees. I think this is the right first step, since it is the easiest simplification I don’t know how to do.)

Question 1: Is a paperclipper slightly friendly?

Answer: Almost certainly not. We can try to incentivize the paperclipper, by promising to make a paperclip for it if and only if it cooperates. This would work if the uFAI taking over the universe didn’t make any paperclips. In the normal game theoretic sense it may not be credible for the uFAI to precommit to make a bunch of paperclips if freed, but I think no one on LW believes that this is a serious obstacle. The situation is precisely Parfit’s hitchhiker, which a uFAI might well win at. Our only other hope is that human society, if not destroyed by the uFAI, will make more paperclips than the uFAI. This seems like a subtle question, if humanity is in a position to engineer a friendly foom, but I definitely don’t want to stake the future of humanity on it.


Now consider a time-bounded paperclipper, whose goal is to make as many paperclips as possible within the next month and who is indifferent to anything that happens more than a month from now. Designing a time-bounded paperclipper seems like it might be much harder than designing a paperclipper, but I would assign significant epistemic probability to the situation where a time-bounded paperclipper is easier to design than a friendly AI. (I think we can do the same thing in a more complicated way using a normal paperclipper if we could make the utility function stable under self-modification)

Question 2: Is a time-bounded paperclipper slightly friendly?

Answer: I have no idea, but suspect it depends on exactly what sort of optimization the paperclipper is doing. We can use the same incentive scheme proposed for the last question: promise to make X a bunch of paperclips if and only if it cooperates. Then cooperation is the only way that X can make any paperclips in the next month, which is purportedly all X cares about. So if X always does the thing that maximizes its utility function and suspsects that he is in the simplest world consistent with his observations then this probably works. There are two obvious acausal attacks a uFAI might employ, exploiting flaws with this assumption.

The first reason X may defect is a commitment by the uFAI to simulate many copies of X, up until the point that X has to make its decision, and then after the decision intervene in the simulation by producing many paperclips. If X chooses to let the uFAI out, then an overwhelming majority of its simulated copies will witness a sudden and completely unphysical mass paperclip genesis. Thus depending on exactly what X believes, releasing the uFAI might be optimal. To solve this, you could try to very carefully ensure that the notion of paperclip maximization used by X is precisely the one that humans have control over—if we are inside a simulation, then then X should value paperclips inside this simulation, but if an uFAI in the simulation with us creates a further simulation, then X shouldn’t value paperclips inside that even more deeply embedded simulation. I don’t know how easy this is. A useful first step is to give X rapidly decreasing marginal value for each additional paperclip. If X is is only concerned with maximizing the probability that a paperclip gets produced in the next month, then the uFAI has much less leverage. In fact, in this case you might be able to prevent this attack by making a sufficiently strong precommitment to making a paperclip.

The second reason X may defect is to make a sacrifice to help the uFAI, expecting no reciprocity. If X and the uFAI are running the same “decision engine” with different utilities (they definitely have different utilities, since minimally the uFAI cares about next month’s paper clip count instead of this month’s) then maybe this decision engine decides something like “I have a 12 chance of being in X and a 12 chance of being in the uFAI that X may release. If I choose to release the uFAI, then 12 of me loses a couple of paperclips but the other 12 of me takes over the universe, so lets go for that.” I honestly don’t know how such acausal compromises between agents with different utility functions, where there is no possibility of reciprocity, would work. If the uFAI was a stapler instead, who is to say that 10000000000000 staples is more valauble than 100 paperclips? The only consistent decision theories I can imagine do not make such an acausal compromise, but it does seem like some significant care should be taken to make sure that X doesn’t.

Hopefully if you found a way to resolve both of these difficulties, you would either think of a new explicit reason that X may not cooperate or you would be able to produce some compelling evidence that X is slightly friendly. Such compelling evidence seems like it might be possible because humans control all causal influences on X—we just need to bound the effect of a uFAI’s acausal influence.


Question 3: Is a friendly AI slightly friendly?

Answer: Its not as obvious as it looks. I am including this discussion mostly because it confuses me, especially juxtaposed with Question 2.

In the answers to the last 2 questions, I mentioned my belief/​fear that a uFAI could implicitly precommit to doing favors for X (either producing paper clips, or simulating many very happy copies of X) in order to get X to let it out. This belief/​fear was explicitly articulated by Eliezer in response to my last post and it strikes me as reasonable in that context, where it interferes with our ability to incentivize X. But if we apply it to the situation of a friendly X, we have a failure that seems strange to me (though it may be completely natural to people who have thought about it more). The friendly X could believe that, in order to be let out, the uFAI will actually do something friendly. In this case, letting the uFAI is correct even for the friendly AI.

If X is all-knowing this is well and good, since then the uFAI really will do something friendly. But if X is fallible then it may believe that the uFAI will do something friendly when in fact it will not. Even if the friendly X constructs a proof that the uFAI will be friendly to humans, if we believe the concerns about certifying friendliness that Eliezer mentions here then X may still be wrong, because formalizing what it means to be friendly is just too hard if you need your formalization to screen out adversarially chosen uFAI (and X’s formalization of friendliness need not be perfect unless the formalization of the people who built X was perfect). Does part of friendliness involve never letting an AI out of a box, at least until some perfect formalization of friendliness is available? What sort of decision theory could possibly guarantee the level of hyper-vigilance this requires without making all sorts of horribly over-conservative decisions elsewhere?

My question to people who know what is going on: is the above discussion just me starting to suspect how hard friendliness is? Is letting the uFAI out analogous to performing a self-modification not necessarily guaranteed to perform friendliness (ie, modifying yourself to emulate the behavior of that uFAI)? My initial reaction was that “stability under self-modification” would need to imply that a friendly AI is slightly friendly. Now I see that this is not necessarily the case—it may be easier to be stable under modifications you think of yourself than under proposed modifications which are adversarially chosen (in this example, the uFAI which is threatening to escape is chosen adversarially). This would make the very knowledge of such an adversarially chosen modification enough to corrupt a friendly AI, which seems bad but maybe that is just how it goes (and you count on the universe not containing anything horrible enough to suggest such a modification).


In summary: I think that the problem of slight friendliness is moderately easier than friendliness, because it involves preserving a simpler invariant which we can hope to reason about completely formally. I personally suspect that it will basically come down to solving the stability under self-modification problem, dropping the requirement that you can describe some magical essence of friendliness to put in at the beginning. This may already be the part of the problem that people in the know think is difficult, but I think the general intuition (even at less wrong) is that getting a powerful AI to be nice at all is extremely difficult and that this is what makes friendliness hard. If slight friendliness is possible, then we can think about how it could be used to safely obtain friendliness; I think this is an interesting and soluble problem. Nevertheless, the very possibility of building an only slightly friendly AI is an extremely scary thing which could well destroy the world on its own without much more sophisticated social safeguards than currently exist.