Vingean Reflection: Reliable Reasoning for Self-Improving Agents
I’m pleased to announce a new paper from MIRI: Vingean Reflection: Reliable Reasoning for Self-Improving Agents.
Today, human-level machine intelligence is in the domain of futurism, but there is every reason to expect that it will be developed eventually. Once artificial agents become able to improve themselves further, they may far surpass human intelligence, making it vitally important to ensure that the result of an “intelligence explosion” is aligned with human interests. In this paper, we discuss one aspect of this challenge: ensuring that the initial agent’s reasoning about its future versions is reliable, even if these future versions are far more intelligent than the current reasoner. We refer to reasoning of this sort as Vingean Reflection.
A self-improving agent must reason about the behavior of its smarter successors in abstract terms, since if it could predict their actions in detail, it would already be as smart as them. This is called the Vingean principle, and we argue that theoretical work on Vingean reflection should focus on formal models that reflect this principle. However, the framework of expected utility maximization, commonly used to model rational agents, fails to do so. We review a body of work which instead investigates agents that use formal proofs to reason about their successors. While it is unlikely that real-world agents would base their behavior entirely on formal proofs, this appears to be the best currently available formal model of abstract reasoning, and work in this setting may lead to insights applicable to more realistic approaches to Vingean reflection.
This is the fourth in a series of six papers discussing various components of MIRI’s technical research agenda. It motivates the field of Vingean reflection, which studies methods by which agents can reason reliably about agents that are more intelligent than themselves. Toy models used to study this problem in the past include the “tiling agent” models that have been discussed on LessWrong in the past. The introduction to the paper runs as follows:
In a 1965 article, I.J. Good introduced the concept of an “intelligence explosion” (Good 1965):
Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make.
Almost fifty years later, a machine intelligence that is smart in the way humans are remains the subject of futurism and science fiction. But barring global catastrophe, there seems to be little reason to doubt that humanity will eventually create a smarter-than-human machine. Whether machine intelligence can really leave the intelligence of biological humans far behind is less obvious, but there is some reason to think that this may be the case (Bostrom 2014): First, the hardware of human brains is nowhere close to physical limits; and second, not much time has passed on an evolutionary timescale since humans developed language, suggesting that we possess the minimal amount of general intelligence necessary to develop a technological civilization, not the theoretical optimum.
It’s not hard to see that if building an artificial superintelligent agent will be possible at some point in the future, this could be both a great boon to humanity and a great danger if this agent does not work as intended (Bostrom 2014, Yudkowsky 2008). Imagine, for example, a system built to operate a robotic laboratory for finding a cure for cancer; if this is its only goal, and the system becomes far smarter than any human, then its best course of action (to maximize the probability of achieving its goal) may well be to convert all of Earth into more computers and robotic laboratories—and with sufficient intelligence, it may well find a way to do so. This argument generalizes, of course: While there is no reason to think that an artificial intelligence would be driven by human motivations like a lust for power, any goals that are not quite ours would place it at odds with our interests.
How, then, can we ensure that self-improving smarter-than-human machine intelligence, if and when it is developed, is beneficial to humanity?
Extensive testing may not be sufficient. A smarter-than-human agent would have an incentive to pretend during testing that its goals are aligned with ours, even if they are not, because we might otherwise attempt to modify it or shut it down (Bostrom 2014). Hence, testing would only give reliable information if the system is not yet sufficiently intelligent to deceive us. If, at this point, it is also not yet intelligent enough to realize that its goals are at odds with ours, a misaligned agent might pass even very extensive tests.
Moreover, the test environment may be very different from the environment in which the system will actually operate. It may be infeasible to set up a testing environment which allows a smarter-than-human system to be tested in the kinds of complex, unexpected situations that it might encounter in the real world as it gains knowledge and executes strategies that its programmers never conceived of.
For these reasons, it seems important to have a theoretical understanding of why the system is expected to work, so as to gain high confidence in a system that will face a wide range of unanticipated challenges (Soares and Fallenstein, 2014a). By this we mean two things: (1) a formal specification of the problem faced by the system; and (2) a firm understanding of why the system (which must inevitably use practical heuristics) is expected to perform well on this problem.
It may seem odd to raise these questions today, with smarter-than-human machines still firmly in the domain of futurism; we can hardly verify that the heuristics employed by an artificial agent work as intended before we even know what these heuristics are. However, Soares and Fallenstein (2014a) argue that there is foundational research we can do today that can help us understand the operation of a smarter-than-human agent on an abstract level.
For example, although the expected utility maximization framework of neoclassical economics has serious shortcomings in describing the behavior of a realistic artificial agent, it is a useful starting point for asking whether it’s possible to avoid giving a misaligned agent incentives for manipulating its human operators (Soares 2015). Similarly, it allows us to ask what sorts of models of the environment would be able to deal with the complexities of the real world (Hutter 2000). Where this framework falls short, we can ask how to extend it to capture more aspects of reality, such as the fact that an agent is a part of its environment (Orseau 2012), and the fact that a real agent cannot be logically omniscient (Gaifman 2004, Soares and Fallenstein 2015). Moreover, even when more realistic models are available, simple models can clarify conceptual issues by idealizing away difficulties not relevant to a particular problem under consideration.
In this paper, we review work on one foundational issue that would be particularly relevant in the context of an intelligence explosion—that is, if humanity does not create a superintelligent agent directly, but instead creates an agent that attains superintelligence through a sequence of successive self-improvements. In this case, the resulting superintelligent system may be quite different from the initial verified system. The behavior of the final system would depend entirely upon the ability of the initial system to reason correctly about the construction of systems more intelligent than itself.
This is no trouble if the initial system is extremely reliable: if the reasoning of the initial agent were at least as good as a team of human AI researchers in all domains, then the system itself would be at least as safe as anything designed by a team of human researchers. However, if the system were only known to reason well in most cases, then it seems prudent to verify its reasoning specifically in the critical case where the agent reasons about self-modifications.
At least intuitively, reasoning about the behavior of an agent which is more intelligent than the reasoner seems qualitatively more difficult than reasoning about the behavior of a less intelligent system. Verifying that a military drone obeys certain rules of engagement is one thing; verifying that an artificial general would successfully run a war, identifying clever strategies never before conceived of and deploying brilliant plans as appropriate, seems like another thing entirely. It is certainly possible that this intuition will turn out to be wrong, but it seems as if we should at least check: if extremely high confidence must be placed on the ability of self-modifying systems to reason about agents which are smarter than the reasoner, then it seems prudent to develop a theoretical understanding of satisfactory reasoning about smarter agents. In honor of Vinge (1993), who emphasizes the difficulty of predicting the behavior of smarter-than-human agents with human intelligence, we refer to reasoning of this sort as Vingean reflection.
The simplest and cleanest formal model of intelligent agents is the framework of expected utility maximization. Given that this framework has been a productive basis for theoretical work both in artificial intelligence in general, and on smarter-than-human agents in particular, it is natural to ask whether it can be used to model the reasoning of self-improving agents.
However, although it can be useful to consider models that idealize away part of the complexity of the real world, it is not difficult to see that in the case of self-improvement, expected utility maximization idealizes away too much. An agent that can literally maximize expected utility is already reasoning optimally; it may lack information about its environment, but it can only fix this problem by observing the external world, not by improving its own reasoning processes.
A particularly illustrative example of the mismatch between the classical theory and the problem of Vingean reflection is provided by the standard technique of backward induction, which finds the optimal policy of an agent facing a sequential decision problem by considering every node in the agent’s entire decision tree. Backward induction starts with the leaves, figuring out the action an optimal agent would take in the last timestep (for every possible history of what happened in the previous timesteps). It then proceeds to compute how an optimal agent would behave in the second-to-last timestep, given the behavior in the last timestep, and so on backward to the root of the decision tree.
A self-improving agent is supposed to become more intelligent as time goes on. An agent using backward induction to choose its action, however, would have to compute its exact actions in every situation it might face in the future in the very first timestep—but if it is able to do that, its initial version could hardly be called less intelligent than the later ones!
Since we are interested in theoretical understanding, the reason we see this as a problem is not that backward induction is impractical as an implementation technique. For example, we may not actually be able to run an agent which uses backward induction (since this requires effort exponential in the number of timesteps), but it can still be useful to ask how such an agent would behave, say in a situation where it may have an incentive to manipulate its human operators (Soares 2015). Rather, the problem is that we are trying to understand conceptually how an agent can reason about the behavior of a more intelligent successor, and an “idealized” model that requires the original agent to already be as smart as its successors seems to idealize away the very issue we are trying to investigate.
The programmers of the famous chess program Deep Blue, for example, couldn’t have evaluated different heuristics by predicting, in their own heads, where each heuristic would make Deep Blue move in every possible situation; if they had been able to do so, they would have been able to play world-class chess themselves. But this does not imply that they knew nothing about Deep Blue’s operation: their abstract knowledge of the code allowed them to know that Deep Blue was trying to win the game rather than to lose it, for example.
Like Deep Blue’s programmers, any artificial agent reasoning about smarter successors will have to do so using abstract reasoning, rather than by computing out what these successors would do in every possible situation. Yudkowsky and Herreshoff (2013) call this observation the Vingean principle, and it seems to us that progress on Vingean reflection will require formal models that implement this principle, instead of idealizing the problem away.
This is not to say that expected utility maximization has no role to play in the study of Vingean reflection. Intuitively, the reason the classical framework is unsuitable is that it demands logical omniscience: It assumes that although an agent may be uncertain about its environment, it must have perfect knowledge of all mathematical facts, such as which of two algorithms is more efficient on a given problem or which of two bets leads to a higher expected payoff under a certain computable (but intractable) probability distribution. Real agents, on the other hand, must deal with logical uncertainty (Soares and Fallenstein 2015). But many proposals for dealing with uncertainty about mathematical facts involve assigning probabilities to them, which might make it possible to maximize expected utility with respect to the resulting probability distribution.
However, while there is some existing work on formal models of logical uncertainty (see Soares and Fallenstein  for an overview), none of the approaches the authors are aware of are models of abstract reasoning. It is clear that any agent performing Vingean reflection will need to have some way of dealing with logical uncertainty, since it will have to reason about the behavior of computer programs it cannot run (in particular, future versions of itself). At present, however, formal models of logical uncertainty do not yet seem up to the task of studying abstract reasoning about more intelligent successors.
In this paper, we review a body of work which instead considers agents that use formal proofs to reason about their successors, an approach first proposed by Yudkowsky and Herreshoff (2013). In particular, following these authors, we consider agents which will only perform actions (such as self-modifications) if they can prove that these actions are, in some formal sense, “safe″. We do not argue that this is a realistic way for smarter-than-human agents to reason about potential actions; rather, formal proofs seem to be the best formal model of abstract reasoning available at present, and hence currently the most promising vehicle for studying Vingean reflection.
There is, of course, no guarantee that results obtained in this setting will generalize to whatever forms of reasoning realistic artificial agents will employ. However, there is some reason for optimism: at least one such result (the procrastination paradox [Yudkowsky 2013], discussed in Section 4) both has an intuitive interpretation that makes it seem likely to be relevant beyond the domain of formal proofs, and has been shown to apply to one existing model of self-referential reasoning under logical uncertainty (Fallenstein 2014b).
The study of Vingean reflection in a formal logic framework also has merit in its own right. While formal logic is not a good tool for reasoning about a complex environment, it is a useful tool for reasoning about the properties of computer programs. Indeed, when humans require extremely high confidence in a computer program, they often resort to systems based on formal logic, such as model checkers and theorem provers (US DoD 1985; UK MoD 1991). Smarter-than-human machines attempting to gain high confidence in a computer program may need to use similar techniques. While smarter-than-human agents must ultimately reason under logical uncertainty, there is some reason to expect that high-confidence logically uncertain reasoning about computer programs will require something akin to formal logic.
The remainder of this paper is structured as follows. In the next section, we discuss in more detail the idea of requiring an agent to produce formal proofs that its actions are safe, and discuss a problem that arises in this context, the Löbian obstacle (Yudkowsky and Herreshoff 2013): Due to Gödel’s second incompleteness theorem, an agent using formal proofs cannot trust the reasoning of future versions using the same proof system. In Section 4, we discuss the procrastination paradox, an intuitive example of what can go wrong in a system that trusts its own reasoning too much. In Section 5, we introduce a concrete toy model of self-rewriting agents, and discuss the Löbian obstacle in this context. Section 6 reviews partial solutions to this problem, and Section 7 concludes.