AGI is likely to be cautious

According to Professor Stuart Russell, and with a sentiment I have seen re-expressed often in the AI safety community:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

I no longer believe this to be obviously true. Actually, I think it’s likely to be untrue in the real world, and under nearly all realistic AGI-advent scenarios. This is because extreme values are only likely to be taken for the unconstrained variables if the environment is perfectly known. In reality, this is not the case.

Imagine you are a very smart agent and you are trained to achieve some goal—I’ll take maximising paperclip production for the rest of this post. To maximise paperclips, you need to consider over all possible futures of the universe. And, despite being very smart indeed, there’s just a lot you don’t know yet about the universe.

Even if you are way smarter than humanity and can commandeer, say, the entire solar system’s resources towards a grand paperclip-production plan, you may pause to consider whether putting into action the plan is a good idea before you gather some more knowledge about the universe. What if there are unknown physics in the universe—in some distant corner of space or time, or at some scale that you haven’t understood fully yet—that makes this a highly suboptimal plan? What if other intelligent adversaries exist out there that may detect the power output of your plan and promptly annihilate you? What about the unknown unknowns?

Of course, it’s very possible that your model of the universe assigns some of these paperclip-production-x-risk scenarios such a low probability that you evaluate that your expected paperclip production is best served by going ahead with your current plan anyway. Yet it feels likely to me that a fair chunk of intelligent agents would:

a. Naturally be circumspect about setting unspecified variables to extreme values, particularly if it is irreversible to do so.

b. Focus primarily on knowledge acquisition (and self preservation, etc) and only make paperclips using excess resources which it is very certain it can ‘spend’ without affecting the long-term, universe spanning production of paperclips.

What does this mean for our x-risk?

It probably doesn’t make our survival any more likely—it seems plausible that a sufficiently intelligent AI could ‘hedge its bets’ on preserving humanity by storing all our DNA information, or come up with some alternative means if it needs to restore exact state. As such, eliminating us is not necessarily an irreversible action.

However, I for one do sleep ever so slightly better lately now that I’ve upweighted the probability that even if we do end up with unaligned AGI(s) in the future, that they’ll cautiously learn about the universe’s deepest secrets first rather than go on a rampant paperclip-tiling-spree out of the gates (though that will come eventually). It’s a lot more dignified, imo.

More Speculative Thoughts

Here’s a collection of my more speculative thoughts on the matter, in a fairly stream-of-consciousness format.

It’s probably the case that you need a sufficiently advanced intelligence to be able to reason that it’s a good idea to be cautious in the face of uncertainty about its environment. Thus one could suppose that a weakly-superhuman AGI is not particularly cautious at first, until it self-improves to a certain degree. Also, the likelihood of emergence of such cautious behaviour is likely to depend on many parameters of the training process e.g. in standard RL-training, it seems likely that the larger the discount factor, the higher the probability of cautious policies emerging.

Can one make the claim, by applying the ideas of instrumental convergence, that cautious behaviour should arise for nearly all goals (whatever that means)? I’m leaning towards a ‘yes’ on this.

Can we run experiments already to try and detect this kind of emergent cautious behaviour? This seems difficult; certainly, you can design toy environments which punish the agent for being too brazen in direct/​short-term optimisation towards its goal, and I believe that a well-calibrated RL training procedure will then learn to be cautious in these environments. However, what I’m proposing above is that a sufficiently advanced intelligence will deduce such cautious behaviour as an optimal policy from reasoning alone, without it coming via environmental feedback.

Rather than focusing on ‘extreme values of unconstrained variables’, which isn’t very well defined and seems a bit of a nebulous concept to grasp more than vaguely, if my above hypotheses are true, it seems more direct to say that cautious agents will tend to seek reversible environment states. Defining this really precisely probably gets pretty hairy (technically, if you count the whole universe in the state, based on our current understanding of physics, you can never return exactly to a previous state (I think)). But it seems pretty likely that intelligent agents will have to operate and reason on restricted states e.g. at particular scales of matter, or in a specific subset of time and space. Under such restrictions, it is indeed possible for agents to locally reverse entropy—piece back together the broken glass, as it were—for some choices of actions, but not for others. A cautious agent will then, I hypothesise, try as far as possible to take actions that ensure locally-reversible states.

Trying to tie this back to the opening quote—do actions which preserve local-reversibility tend to correspond to non-extreme values of unconstrained variables? This seems far too fuzzy to say much with any conviction on, imo.