Superintelligence 12: Malignant failure modes
This is part of a weekly reading group on Nick Bostrom’s book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI’s reading guide.
Welcome. This week we discuss the twelfth section in the reading guide: Malignant failure modes.
This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.
There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).
Reading: ‘Malignant failure modes’ from Chapter 8
Malignant failure mode: a failure that involves human extinction; in contrast with many failure modes where the AI doesn’t do much.
Features of malignant failures
We don’t get a second try
It supposes we have a great deal of success, i.e. enough to make an unprecedentedly competent agent
Some malignant failures:
Perverse instantiation: the AI does what you ask, but what you ask turns out to be most satisfiable in unforeseen and destructive ways.
Example: you ask the AI to make people smile, and it intervenes on their facial muscles or neurochemicals, instead of via their happiness, and in particular via the bits of the world that usually make them happy.
Possible counterargument: if it’s so smart, won’t it know what we meant? Answer: Yes, it knows, but it’s goal is to make you smile, not to do what you meant when you programmed that goal.
AI which can manipulate its own mind easily is at risk of ‘wireheading’ - that is, a goal of maximizing a reward signal might be perversely instantiated by just manipulating the signal directly. In general, animals can be motivated to do outside things to achieve internal states, however AI with sufficient access to internal state can do this more easily by manipulating internal state.
Even if we think a goal looks good, we should fear it has perverse instantiations that we haven’t appreciated.
Infrastructure profusion: in pursuit of some goal, an AI redirects most resources to infrastructure, at our expense.
Even apparently self-limiting goals can lead to infrastructure profusion. For instance, to an agent whose only goal is to make ten paperclips, once it has apparently made ten paperclips it is always more valuable to try to become more certain that there are really ten paperclips than it is to just stop doing anything.
Examples: Riemann hypothesis catastrophe, paperclip maximizing AI
Mind crime: AI contains morally relevant computations, and treats them badly
Example: AI simulates humans in its mind, for the purpose of learning about human psychology, then quickly destroys them.
Other reasons for simulating morally relevant creatures:
Creating indexical uncertainty in outside creatures
In this chapter Bostrom discussed the difficulty he perceives in designing goals that don’t lead to indefinite resource acquisition. Steven Pinker recently offered a different perspective on the inevitability of resource acquisition:
...The other problem with AI dystopias is that they project a parochial alpha-male psychology onto the concept of intelligence. Even if we did have superhumanly intelligent robots, why would they want to depose their masters, massacre bystanders, or take over the world? Intelligence is the ability to deploy novel means to attain a goal, but the goals are extraneous to the intelligence itself: being smart is not the same as wanting something. History does turn up the occasional megalomaniacal despot or psychopathic serial killer, but these are products of a history of natural selection shaping testosterone-sensitive circuits in a certain species of primate, not an inevitable feature of intelligent systems. It’s telling that many of our techno-prophets can’t entertain the possibility that artificial intelligence will naturally develop along female lines: fully capable of solving problems, but with no burning desire to annihilate innocents or dominate the civilization.
Of course we can imagine an evil genius who deliberately designed, built, and released a battalion of robots to sow mass destruction. But we should keep in mind the chain of probabilities that would have to multiply out before it would be a reality. A Dr. Evil would have to arise with the combination of a thirst for pointless mass murder and a genius for technological innovation. He would have to recruit and manage a team of co-conspirators that exercised perfect secrecy, loyalty, and competence. And the operation would have to survive the hazards of detection, betrayal, stings, blunders, and bad luck. In theory it could happen, but I think we have more pressing things to worry about.
2. Adam Elga writes more on simulating people for blackmail and indexical uncertainty.
3. More directions for making AI which don’t lead to infrastructure profusion:
Some kinds of preferences don’t lend themselves to ambitious investments. Anna Salamon talks about risk averse preferences. Short time horizons and goals which are cheap to fulfil should also make long term investments in infrastructure or intelligence augmentation less valuable, compared to direct work on the problem at hand.
4. John Danaher again summarizes this section well, and comments on it.
5. Often when systems break, or we make errors in them, they don’t work at all. Sometimes, they fail more subtly, working well in some sense, but leading us to an undesirable outcome, for instance a malignant failure mode. How can you tell whether a poorly designed AI is likely to just not work, vs. accidentally take over the world? An important consideration for systems in general seems to be the level of abstraction at which the error occurs. We try to build systems so that you can just interact with them at a relatively abstract level, without knowing how the parts work. For instance, you can interact with your GPS by typing places into it, then listening to it, and you don’t need to know anything about how it works. If you make an error while up writing your address into the GPS, it will fail by taking you to the wrong place, but it will still direct you there fairly well. If you fail by putting the wires inside the GPS into the wrong places the GPS is more likely to just not work.
If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser’s list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.
Are there better ways to specify ‘limited’ goals? For instance, to ask for ten paperclips without asking for the universe to be devoted to slightly improving the probability of success?
In what circumstances could you be confident that the goals you have given an AI do not permit perverse instantiations?
Explore possibilities for malignant failure vs. other failures. If we fail, is it actually probable that we will have enough ‘success’ for our creation to take over the world?
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.
How to proceed
This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!
Next week, we will talk about capability control methods, section 13. To prepare, read “Two agency problems” and “Capability control methods” from Chapter 9. The discussion will go live at 6pm Pacific time next Monday December 8. Sign up to be notified here.