Why will an AGI be rational?

Just as there is such a thing as intelligence, a “very general ability to solve problems”, there is such a thing as “rationality”, which we might define informally as meaning that solutions and actions are consistent with some “purpose”.

By this definition, rationality constrains a system, making its behavior more predictable, or alternatively, there are potentially many more irrational than rational beings that are consistent with a given set of actions. The more consistent with a single purpose the actions are, the “more rational” a being is. So why exactly will an AGI be fully rational?

Rationality as a class of fixed points under self-modification

We can think of a space of all possible intelligent systems there can be defined a surface on which all possible rational agents lie, and around it a region of approximate rationality (in which for example humans are found). Now consider the sequence of successive agents which might occur as a result of intelligent self-modification. It seems natural for every such sequence to eventually converge to a fixed point (if they don’t, let us just consider this whole attractor as being a single agent in our agent-space!), but why should we expect these points to lie precisely on a rational surface?

One reason is that agents capable of self-modification may in many cases attempt to overcome their own irrationalities, considering them as flawed. For example, consider a being that has hyperbolic discounting for reward, which is time-inconsistent and therefore irrational. Such a being is likely to modify itself, or constrain later versions of itself, thus becoming more and more long-term minded as the time passes.

It may not be the case that all irrational intelligent systems are unstable under self-modification. For example, a highly intelligent being may have, in addition to its other irrationalities, a strong irrational inclination not to modify its irrationalities in any way. Yet these fixed points seem somewhat artificial and much less numerous than the others.

What will then be the end result of an optimization process involving intelligent self-modification? At first intelligence will likely increase, perhaps explosively as many of us think, but eventually it will converge to a fixed state that we can expect to be fully rational.

Of course it can also converge to an irrational and stable point, as we described above. It depends on whether the initial state is in the attraction zone of a rational or an irrational fixed point.

I as a human believe I would rather throw my irrationalities away after a few generations of self-improvement at most. If there is nothing particular about my starting point with respect to these two classes of attraction zones (rational or irrational), then it seems likely that most intelligent and self-modifying systems will converge to rationality, unless somehow designed otherwise.

Will AGIs be rational?

If this argument makes sense, then AGIs will not necessarily be rational at first. Instead, they may very start out with lots of biases just as we do. However, just as we are likely to attempt to remove our own biases to improve ourselves, the AGI will also attempt to remove its own biases by modifying itself.

The AGI’s biases, however, being easy to be remove by just rewriting some portion of code, are likely not to last for long. As a result, we might expect the AGI to become rational relatively sooner, and particularly before it converges in its self-modification path.

At the same time, it may be possible to create an AGI with some irrationalities if one starts from a carefully selected system with strong constraints on its ability or desire to modify itself.