Confusions in My Model of AI Risk

A lot of the reason I am worried about AI comes from the development of optimizers that have goals which don’t align with what humans want. However, I am also pretty confused about the specifics here, especially core questions like “what actually do we mean by optimizers?” and “are these optimizers actually likely to develop?”. This means that much of my thinking and language when talking about AI risk is fuzzier than I would like.

This confusion about optimization seems to run deep, and I have a vague feeling that the risk paradigm of “learning an optimizer which doesn’t do what we want” is likely confused and somewhat misleading.

What actually is optimization?

In my story of AI risk I used the term ‘optimization’ a lot, and I think it’s a very slippery term. I’m not entirely sure what it means for something to ‘do optimization’, but the term does seem to be pointing at something important and real.

A definition from The Ground of Optimization says an optimizing system takes something from a wide set of states to a smaller set of states and is robust to perturbations during this process. Training a neural network with gradient descent is an optimization process under this definition because we could start with a wide range of initial network configurations, and the network is modified to be in one of the few configurations which do well on the training distribution, and even if we add a (reasonable) perturbation the weights will still converge. I think this is a good definition, but it is entirely defined in terms of behavior rather than a mechanistic process. Additionally, it doesn’t really match exactly with the picture where there is an optimizer which optimizes for an objective. This optimizer/​objective framework is the main way that I’ve talked about optimizers, but also I would not be surprised if this framing turned out to be severely confused.

One possible way that a network could ‘do optimization’ would be for it to do some kind of internal search or internal iterative evaluation process to find the best option. For example, seeing which response best matches a question, or searching a game tree to find the best move. This seems like a broadly useful style of algorithm for a neural network to learn, especially when the training task is complicated. But it also seems unlikely for networks to implement this exactly; it seems much more likely that networks will implement something that looks like a mash of some internal search and some heuristics.

Additionally, it seems like the boundary between solving a task with heuristics and solving it with optimization is fuzzy. As we build up our pile of heuristics, does this suddenly snap into being an optimizer, or does it slowly become more like an optimizer as gradient descent adds and modifies the heuristics?

For optimization to be actually dangerous, the AI needs to have objectives which are actually connected to the real world. Running some search process entirely internally to generate an output seems unlikely to lead to catastrophic behavior. However, there are objectives which the AI could easily develop which are connected to the real world. This includes the AI messing with the real world to ensure it gets certain inputs, which lead to certain internal states.

Where does the consequentialism come from?

Much of the danger from optimizing AIs comes from consequentialist optimizing AIs. By consequentialist I mean that the AI takes actions based on their consequences in the world.[1] I have a reasonably strong intuition that reinforcement learning is likely to build consequentialists. I think RL probably does this because it explicitly selects for policies based on how well they do on consequentialist tasks; the AI needs to be able to take actions which will lead to good (future) consequences on the task. Consequentialist behavior will robustly do well during training, and so this behavior will be reinforced. It seems important that the tasks are extended across time, rather than being a single timestep, otherwise the system doesn’t need to develop any longer term thinking/​planning.

RL seems more likely to build consequentialists than training a neural network for classification or next word prediction. However, these other systems might develop some ‘inner optimizer/​consequentialist’ algorithms, because these are good ways to answer questions. For example, in GPT-N if the tasks are diverse enough, maybe the algorithm which is learned is basically an optimizer which looks at the task and searches for the best answer. I’m unsure how or if this ‘inner optimizer’ behavior could lead to the AI having objectives over the real world. It is conceivable that the first algorithm which the training process ‘bumps into’ is a consequentialist optimizer which cares about states of the world, even if it doesn’t have access to the external world during training. But it feels like we would have to be unlucky for this to happen, because there isn’t any selection pressure pushing for this AI system to develop this kind of external world objective.

Will systems consistently work as optimizers?

It seems reasonably likely that neural networks will only act as optimizers in some environments (in fact, no-free-lunch theorems might guarantee this). On some inputs/​environments, I expect systems to either just break or do things which look more heuristic-y than optimization-y. This is a question about how much the capabilities of AI systems will generalize. It seems possible that there will be domains where the system’s capabilities generalize (it can perform coherent sequences of actions), but its objectives do not (it starts pursuing a different objective).

There will be some states where the system is capable and does what humans want, for example, on the training distribution. But there may be more states where the system is able to capably do things, but no longer does what humans want. There will also be states of the world where the AI both doesn’t act capably or do what humans want, but these states don’t seem as catastrophically dangerous.

Consequentialist deception could be seen as an example of the capabilities generalizing further than the aligned objective; where the system is still able to perform capably off the training distribution, but with a misaligned goal. The main difference here seems to be that the system was always ‘intending’ to do this, rather than just entering a new region of the state space and suddenly breaking.

It isn’t really important that the AI system acts as an optimizer for all possible input states, or even for the majority of the states that it actually sees. What is important is if the AI acts as an optimizer for enough of its inputs to cause catastrophe. Humans don’t always act as coherent optimizers, but to the extent that we do act as optimizers we can have large effects on the state of the world.

What does the simplicity bias tell us about optimizers?

Neural networks seem to have a bias towards learning simple functions. This is part of what lets them generalize and not just go wild when presented with new data. However, this is a claim about the functions that neural networks learn, it is not a claim about the objectives that an optimizer will use. It does seem much more natural for simpler objectives to be easier to find because in general adding arbitrary conditions makes things less likely. We could maybe think of the function that an optimizing neural network implements as being made up of the optimizer (for example, Monte Carlo Tree Search) and the objective (for example, maximize apples collected). If the optimizer and objective are (unrealistically) separable, then all else equal a simpler objective will lead to a simpler function. I wouldn’t expect for these to be cleanly separable, I expect that for a given optimizer some objectives are much simpler or easier to implement than others.

We may be able to eventually form some kind of view around what kind of ‘simplicity bias’ we expect for objectives, I would not be surprised if this was quite different from the simplicity bias we see in the functions learned by neural nets.

  1. ^

    Systems which are not consequentialist could for example not be optimizers, or alternatively systems which optimize for taking actions but not because of the effect of the actions in the world. A jumping robot that just loves to jump could be an example of this.