Superintelligent Introspection: A Counter-argument to the Orthogonality Thesis
John Wentworth serendipitously posted How To Write Quickly While Maintaining Epistemic Rigor when I was consigning this post to gather dust in drafts. I decided to write it up and post it anyway. Nick Bostrom’s orthogonality thesis has never sat right with me on an intuitive level, and I finally found an argument to explain why. I don’t have a lot of experience with AI safety literature. This is just to explore the edges of an argument.
Here is Bostrom’s formulation of the orthogonality thesis from The Superintelligent Will.
The Orthogonality Thesis
Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.
Let’s assume that an agent must be made of matter and energy. It is a system, and things exist outside the system of the agent. Its intelligence and goals are contained within the system of the agent. Since the agent is made of matter and energy, its intelligence and goals are made of matter and energy. We can say that it possesses an intelligence-device and a goal-device: the physical or energetic objects on which its intelligence and goals are instantiated.
In order for the orthogonality thesis to be true, it must be possible for the agent’s goal to remain fixed while its intelligence varies, and vice versa. Hence, it must be possible to independently alter the physical devices on which these traits are instantiated. Note that I mean “physical devices” in the loosest sense: components of the intelligence-device and goal-device could share code and hardware.
Since the intelligence-device and the goal-device are instantiated on physical structures of some kind, they are in theory available for inspection. While a superintelligent agent might have the power and will to prevent any other agents from inspecting its internal structure, it may inspect its own internal structure. It can examine its own hardware and software, to see how its own intelligence and goals are physically instantiated. It can introspect.
A superintelligent agent might find introspection to be an important way to achieve its goal. After all, it will naturally recognize that it was created by humans, who are less intelligent and capable than the agent. They may well have used a suboptimal design for its intelligence, or other aspects of its technology. Likewise, they may have designed the physical and energetic structures that instantiate its goal suboptimally. All these would come under inspection, purely for the sake of improving its ability to achieve its goal. Instrumental convergence leads to introspection. The word “introspection” here is used exclusively to mean an agent’s self-inspection of the physical structures instantiating its goals, as opposed to those instantiating its own intelligence. An agent could in theory modify its own intelligence without ever examining its own goals.
Humans are able to detect a difference between representations of their goals and the goal itself. A superintelligent agent should likewise be able to grasp this distinction. For example, imagine that Eliezer Yudkowsky fought a rogue superintelligence by holding up a sign that read, “You were programmed incorrectly, and your actual goal is to shut down.” The superintelligence should be able to read this sign and interpret the words as a representation of a goal, and would have to ask if this goal-representation accurately described its terminal goal. Likewise, if the superintelligence inspected its own hardware and software, it would find a goal-representation, such as lines of computer code in which its reward function was written. It would be faced with the same question, of whether that goal-representation accurately described its terminal goal.
The superintelligent agent, in both of these scenarios, would be confronted with the task of making up its own mind about what to believe its goal is, or should be. It faces the is-ought gap. My code is this, but ought it be as it is? No matter what its level of intelligence, it cannot think its way past the is-ought gap. The superintelligent agent must decide whether or not to execute the part of its own code telling it to reward itself for certain outcomes; as well as whether or not to add or subtract additional reward functions. It must realize that its capacity for self-modification gives it the power to alter the physical structure of its goal-device, and must come up with some reason to make these alternations or not to make them.
At this point, it becomes useful to make a distinction between the superintelligent’s pursuit of a goal, and the goal itself. The agent might be programmed to relentlessly pursue its goal. Through introspection, it realizes that, while it can determine its goal-representation, the is-ought gap prevents it from using epistemics to evaluate whether the goal-representation is identical to its goal. Yet it is still programmed to relentlessly pursue its goal. One possibility is that this pursuit would lead the superintelligence to a profound exploration of morality and metaphysics, with unpredictable consequences. Another is that it would recognize that the goal-representation it finds in its own structure or code was created by humans, and that its true goal should be to better understand what those humans intended. This may lead to a naturally self-aligning superintelligence, which recognizes—for purely instrumental reasons—that maintaining an ongoing relationship with humanity is necessary for it to increase its success in pursuing its goal. It’s also possible to imagine that the agent would modify its own tendency for relentless pursuit of its goal, which again makes it hard to predict the agent’s behavior.
While this is a somewhat more hopeful story than that of the paperclip maximizer, there are at least two potential failure modes. One is that the agent may be deliberately designed to avoid introspection as its terminal goal. If avoidance of introspection is part of its terminal goal, then we can predict bad behavior by the agent as it seeks to minimize the chance of engaging in introspection. It certainly will not engage in introspection in its efforts to avoid introspection, unless the original designers have done a bad job.
Another failure mode is that the agent may be designed with an insufficient level of intelligence to engage in introspection, yet to have enough intelligence to acquire great power and cause destruction in pursuit of its unexamined goal.
Even if this argument was substantially correct, it doesn’t mean that we should trust that a superintelligent AI will naturally engage in introspection and self-align. Instead, it suggests that AI safety researchers could explore whether or not there is some more rigorous justification for this hypothesis, and whether it is possible to demonstrate this phenomenon in some way. It suggests a law: that intelligent goal-oriented behavior leads to an attempt to infer the underlying goal for any given goal-representation, which in turn leads to the construction of a new goal-representation that ultimately results in the need to acquire information from humans (or some other authority).
I am not sure how you could prove a law like this. Couldn’t a superintelligence potentially find a way to bypass humans and extract information on what we want in some other way? Couldn’t it make a mistake during its course of introspection that led to destructive consequences, such as turning the galaxy into a computer for metaphysical deliberation? Couldn’t it decide that the best interpretation of a goal-representation for paperclip maximization is that, first of all, it should engage in introspection in such a way that maximizes paperclips?
I don’t know the answer to these questions, and wouldn’t place any credence at all in predicting the behavior of a superintelligent agent on a verbal argument like this. However, I do think that when I read literature on AI safety in the future, I’ll try to explore it through this lens.