Where do you get your capabilities from?

In both alignment and capabilities research, people end up discussing the effects and feasibility of different AI architectures, and when this happens, I tend to focus on the question: Where do you get your capabilities from?

Capabilities are possible because the world has structure that leads to there being common subtasks that are often useful to solve. In order to create an AI with capabilities, there must be some process which encodes solutions to these common subtasks into that AI.

Good Old Fashioned AI and Ordinary programming

In GOFAI and ordinary programming, the programmer notices what those common subtasks are, and designs specialized algorithms to represent and solve them. This means that the programmer manually designs the capabilities for the AI, using more advanced capabilities that the programmer has.

The main difficulty with this is that there are lots of common subtasks, many of which are very complex and therefore hard to model. Manually noticing and designing all of them takes too many programmer resources to be viable.

Consequentialism

Consequentialism, broadly defined, is a general and useful way to develop capabilities.

Under consequentialism, you consider the consequences of different things you could do, and apply a search process (of which there are many) to select an approach that works. Naively, the fact that consequentialism works is tautological; if you choose the option that works, then it works. In practice, the challenge for consequentialism comes from embedded agency, with perhaps the most significant challenge being that you need some good map/​model of what happens as you apply different choices, so you can know what the consequences are.

Consequentialism is a big part of how humans act. If you take some action (such as moving a chair), you usually have some purpose in mind for why you took it (such as intending to sit on the chair), and you usually wouldn’t have taken the same action if you thought it would have led to a radically different outcome (such as the chair breaking).

The essence of consequentialism is captured by utility maximization. That’s not to say that utility maximization covers all aspects of consequentialism; most notably, there is the possibility of subagents, which extends consequentialism and permits more behaviors than pure utility maximization does. As such we should be careful about overestimating the extent to which utility maximization captures all the relevant properties of consequentialist agency, but at the same time it does seem to capture some important ones.

Consequentialism is really broad. Evolution is a consequentialist, though a very inefficient one. It uses the real world—specifically, evolutionary history—as a model for the future. Humans have consequentialism as a significant part of our thinking. Gradient descent with backpropagation is consequentialism for differentiable computation networks. Money reifies value to permit consequentialism to act in larger-than-human world-spanning markets. Classical reinforcement learning is a form of consequentialism.

Imitation learning

The world already contains highly capable agents, namely humans and human society. This permits a shortcut to gaining capabilities: In order for humans to be capable, humans must do things that are useful, so an AI can just mimic the things humans do in order to pick up capabilities too. Essentially exploiting Aumann’s agreement theorem for capabilities progress.

This is the principle behind GPT-3. People have written text that includes all sorts of useful things, either as direct knowledge (e.g. text going “The capital of Germany is Berlin”) or as latent variables that generate correlations in the text (e.g. arithmetic expression like “2+2″ tend to be followed by the correct answer to that arithmetic). By creating the same sorts of text that humans do, GPT gains knowledge and skills that humans have.

Reinforcement learning from human feedback

Reinforcement learning is a consequentialist problem statement, and so one can say that reinforcement learning from human feedback falls under consequentialism in the above typology. However, I think there are some additional interesting distinctions that can be drawn.

Consider the reinforcement learning applied to ChatGPT. If we simplify by ignoring a few of the components, then basically OpenAI made ChatGPT generate multiple texts, and then people rated the texts for how good they were, and OpenAI adjusted the texts to be more like the texts that people rated as being good.

Here, there arguably wasn’t much direct consequentialism in the loop. If e.g. GPT suggested doing something harmful, then it is not that the human raters would have tried doing the thing and noticed its harm, nor is it that GPT would predict the harm and adjust itself. Rather, the human raters would reason theoretically to come to the conclusion that the suggestion would be harmful to enact.

This theoretical prediction about what happens as instructions are executed in a sense resembles what programmers do with GOFAI/​ordinary programming, except that GPT makes it easy for non-programmers to do this reasoning, because GPT’s instructions are in plain English and describe concrete scenarios, whereas programmers usually deal with abstract algorithms written in programming languages. However, I think it is fundamentally the same sort of capability gain: Use a human’s capabilities to think about and decide what a computer should do.

It should be noted that this is not the only form of RLHF. There are other forms of RLHF where the AI is explicitly hooked up to reality or to a model, such that the consequences of what the AI does are not computed by a human, but instead by a non-human process that might consider consequences which the human is missing. This other form of RLHF basically uses the human to pick the optimization target for a classically consequentialist algorithm. I think the key distinction between these two forms is in evaluating actions vs outcomes/​trajectories.

Unsupervised prediction of the world

Unsupervised (or self-supervised) prediction refers to when prediction algorithms are optimized to predict one part of a big naturally-occurring dataset from another part of the dataset, rather than people manually constructing a task-specific dataset to optimize the prediction algorithm for. For instance, an unsupervised model might try to predict later events from earlier events.

GPT does unsupervised prediction of human text, which as discussed above is mainly useful as a form of imitation learning. But it is also possible to improve capabilities of certain AI systems by performing unsupervised prediction of the world. For example, image classifiers can often be improved by certain unsupervised training tasks, and I am excited about recent work going into video extrapolation.

AI trained on recordings of the world rather than human text doesn’t gain capabilities from mimicking capable agents in the world, because it is mostly not mimicking capable agents. Rather, I think unsupervised prediction is mainly useful because it is a way to build a map, as the unsupervised predictor learns to approximate the dynamics and distributions of reality.

At its most basic, unsupervised prediction forms a good foundation for later specializing the map to perform specific types of prediction (as in finetuning for image classification). I think as we come to better understand natural abstractions, we may increasingly come to see this as basically just another form of ordinary programming. I know I already have detailed models of what principal component analysis does, to the point where I can often just think of it as an ordinary programming tool; presumably the same will come to apply to more and more unsupervised learning algorithms.

Unsupervised prediction is also directly useful, e.g. for predicting the effects of actions for consequentialists. Many of the most exciting new approaches for consequentialist AI are model-based, and I expect this to continue as we run into the limitations of imitation learning.

Constitutional AI

Anthropic recently published a paper on “Constitutional AI”. In it, they created what they called a “constitution”, which basically amounts to using natural language terms such as “harmful” or “illegal” to reference what sorts of properties they would like to prevent the responses of a GPT-style language model from having. They then took some responses the language model had to some questions, and asked the language model to evaluate whether the those responses were in accordance with the constitution, and used this to fine-tune their language model. It improved the language model a bunch—but why?

It seems like we can apply the principles earlier in the post to explain it. The pretraining of the language model was a form of imitation learning that gave it the capability to recognize some forms of harm and crime, within the realm of text descriptions, as well as to answer questions in both harmful and non-harmful ways. Its ability to recognize harm and crime then gets used as a model, to consequentialistically avoid proposing things that are criminal and harmful.

Implications

In my opinion, the implications of the “Where do you get your capabilities from?” question are:

Bounded breakdown of the orthogonality thesis: A central point in alignment is the orthogonality thesis, that any amount of intelligence can be applied towards any goal. The orthogonality thesis applies straightforwardly to consequentialist optimization, but it immediately breaks down when you consider other ways of gaining capabilities, such as imitation learning or GOFAI/​ordinary programming. For instance, with imitation learning, you are mimicking human actions, and doing so is useful precisely because they are already optimized for promoting human values. (h/​t DragonGod who alerted me in the strongest terms that orthogonality breaks down with GPT-style training.)

Human vs far-superhuman abilities: It seems like imitation learning can get us to human level of capabilities, and a bit beyond that (because imitation learning be run massively in parallel, learning from all the best humans, and thus probably produce an artifact that has similar performance to top humans across all domains). However, it seems to me that it cannot produce truly novel (e.g. far-superhuman) capabilities, and so I would expect consequentialism to remain relevant.

In fact, it seems to me that the root of all capabilities is consequentialism. I mean this from two views:

  • When I enumerate the various known ways that capabilities have been gained, then they either seem to be getting them from elsewhere (e.g. imitation learning), or seem to be consequentialism.

  • From a theoretical point of view, no free lunch theorems teach us that capabilities are about fit with the environment. In order to develop them, you have to take information about the environment’s support for them into account. Just from an informational point of view, consequentialism seems required for developing capabilities.

Thanks to Justis Mills for proofreading and feedback.