Superintelligence 11: The treacherous turn
This is part of a weekly reading group on Nick Bostrom’s book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI’s reading guide.
Welcome. This week we discuss the 11th section in the reading guide: The treacherous turn. This corresponds to Chapter 8.
This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.
There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).
Reading: “Existential catastrophe…” and “The treacherous turn” from Chapter 8
The possibility of a first mover advantage + orthogonality thesis + convergent instrumental values suggests doom for humanity (p115-6)
First mover advantage implies the AI is in a position to do what it wants
Orthogonality thesis implies that what it wants could be all sorts of things
Instrumental convergence thesis implies that regardless of its wants, it will try to acquire resources and eliminate threats
Humans have resources and may be threats
Therefore an AI in a position to do what it wants is likely to want to take our resources and eliminate us. i.e. doom for humanity.
One kind of response: why wouldn’t the makers of the AI be extremely careful not to develop and release dangerous AIs, or relatedly, why wouldn’t someone else shut the whole thing down? (p116)
It is hard to observe whether an AI is dangerous via its behavior at a time when you could turn it off, because AIs have convergent instrumental reasons to pretend to be safe, even if they are not. If they expect their minds to be surveilled, even observing their thoughts may not help. (p117)
The treacherous turn: while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values. (p119)
We might expect AIs to be more safe as they get smarter initially—when most of the risks come from crashing self-driving cars or mis-firing drones—then to get much less safe as they get too smart. (p117)
One can imagine a scenario where there is little social impetus for safety (p117-8): alarmists will have been wrong for a long time, smarter AI will have been safer for a long time, large industries will be invested, an exciting new technique will be hard to set aside, useless safety rituals will be available, and the AI will look cooperative enough in its sandbox.
The conception of deception: that moment when the AI realizes that it should conceal its thoughts (footnote 2, p282)
This is all superficially plausible. It is indeed conceivable that an intelligent system — capable of strategic planning — could take such treacherous turns. And a sufficiently time-indifferent AI could play a “long game” with us, i.e. it could conceal its true intentions and abilities for a very long time. Nevertheless, accepting this has some pretty profound epistemic costs. It seems to suggest that no amount of empirical evidence could ever rule out the possibility of a future AI taking a treacherous turn. In fact, its even worse than that. If we take it seriously, then it is possible that we have already created an existentially threatening AI. It’s just that it is concealing its true intentions and powers from us for the time being.
I don’t quite know what to make of this. Bostrom is a pretty rational, bayesian guy. I tend to think he would say that if all the evidence suggests that our AI is non-threatening (and if there is a lot of that evidence), then we should heavily discount the probability of a treacherous turn. But he doesn’t seem to add that qualification in the chapter. He seems to think the threat of an existential catastrophe from a superintelligent AI is pretty serious. So I’m not sure whether he embraces the epistemic costs I just mentioned or not.
1. Danaher also made a nice diagram of the case for doom, and relationship with the treacherous turn:
According to Luke Muehlhauser’s timeline of AI risk ideas, the treacherous turn idea for AIs has been around at least 1977, when a fictional worm did it:
1977: Self-improving AI could stealthily take over the internet; convergent instrumental goals in AI; the treacherous turn. Though the concept of a self-propagating computer worm was introduced by John Brunner’s The Shockwave Rider (1975), Thomas J. Ryan’s novel The Adolescence of P-1 (1977) tells the story of an intelligent worm that at first is merely able to learn to hack novel computer systems and use them to propagate itself, but later (1) has novel insights on how to improve its own intelligence, (2) develops convergent instrumental subgoals (see Bostrom 2012) for self-preservation and resource acquisition, and (3) learns the ability to fake its own death so that it can grow its powers in secret and later engage in a “treacherous turn” (see Bostrom forthcoming) against humans.
3. The role of the premises
Bostrom’s argument for doom has one premise that says AI could care about almost anything, then another that says regardless of what an AI cares about, it will do basically the same terrible things anyway. (p115) Do these sound a bit strange together to you? Why do we need the first, if final values don’t tend to change instrumental goals anyway?
It seems the immediate reason is that an AI with values we like would not have the convergent goal of taking all our stuff and killing us. That is, the values we want an AI to have are some of those rare values that don’t lead to destructive instrumental goals. Why is this? Because we (and thus the AI) care about the activites the resources would be grabbed from. If the resources were currently being used for anything we didn’t care about, then our values would also suggest grabbing resources, and look similar to all of the other values. The difference that makes our values special here is just that most resources are already being used for them somewhat.
It is hard to tell apart a safe and an unsafe AI, because both would like to look safe. This is a very common problem in human interactions. For instance, it can be nontrivial to tell a genuine lover from a gold digger, a businessman from a conman, and an expert from a crank. All of them want to look like the desirable sort. Particularly similar to the AI case is that of hiring a new employee for a trial period. You will sometimes find that the employee’s values are much better aligned during the trial period, and then they undergo a ‘treacherous turn’ once they have been hired more thoroughly.
‘Costly signaling’ is a general purpose solution to this problem, which works some of the time. The basic idea is this. Everyone has instrumental reasons to look like the good kind of person, but perhaps their reasons aren’t exactly as strong as one other’s, or the strength of their desire is harder to act on for one group than the other, so we can construct a set of options that will lead the different types of people to choose differently, even though they know this will set them apart. For instance, while an honest businessman and a conman would both like to say ‘I’m an honest businessman’, and to behave honestly if you watch them for five minutes, a conman would benefit less from expensive shopfront as an honest businessman, because his reputation is less valuable, so a brand is a signal of being honest.
Similarly, we could perhaps design situations where most harmful AIs would have a reasonable incentive to reveal themselves. This is difficult because so much is at stake, compared to the cost of the signal. But I haven’t thought enough about it to rule it out, and I haven’t seen anyone else do so either.
5. When would the ‘conception of deception’ take place?
Below the level of the best humans presumably, since we have already thought of all this.
6. Surveillance of the mind
Bostrom is not optimistic that AI researchers could protect much against risks by watching AI mental processes directly for signs of deception, rapid progress, or undesirable scheming. He says if the AI is smart it will realize this might happen, and hide the offending thoughts.
This seems an open question to me, for several reasons:
Making your mental contents look innocuous while maintaining their semantic content sounds potentially very hard
Especially for a creature which has only just become smart enough to realize it should treacherously turn
From the AI’s perspective, even if it is smart, surveillance could seem fairly unlikely, especially if we deceive it about its surroundings
As a consequence of 2, it seems better if the ‘conception of deception’ comes earlier.
If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser’s list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.
How transparent are AI minds likely to be? Should we expect to be able to detect deception? What are the answers to these questions for different specific architectures and methods? This might be relevant.
Are there other good ways to filter AIs with certain desirable goals from others? e.g. by offering them choices that would filter them.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.
How to proceed
This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!
Next week, we will talk about ‘malignant failure modes’ (as opposed presumably to worse failure modes). To prepare, read “Malignant failure modes” from Chapter 8. The discussion will go live at 6pm Pacific time next Monday December 1. Sign up to be notified here.