Sufficiently Advanced Language Models Can Do Reinforcement Learning

Epistemological Status: I’ve suspected this for a while. The mesa-optimization status of GPT seems to be folk-theory at the moment. It seems worthwhile to develop a prototype explanation of it’s abilities. Claims about the connection between replication and RL could be worked out to a higher level of clarity using tropical algebra, but that most likely is overkill so I maintain a certain level of informality throughout.

I want to start by exploring a simple question and the implication of a positive result. Can GPT predict the quality of its output? I’m going to focus on an idealized setting where I assume,

  1. GPT can be treated as an accurate language model.

  2. GPT’s context window is wide enough to learn the relevant task distributions we’re interested in.

  3. Evaluating the quality of output is easier for GPT than producing the given output for a given input.

I’ll show that evaluation can be used to boost the output of this idealized version of GPT. Moreover, this boosting can be interpreted more generally as a natural selection process. As I’ve argued previously, a natural selection process maps cleanly onto RL in the limit. Because OpenAI trains GPT using log-probability we can directly interpret the model as a replication process for special types of tasks that I introduce as evaluable recurrent tasks.

The Setup

Empirically, we’ve seen that GPT can do quite a few things. One could interpret the above assumptions as an attempt to try and explain the empirical success. On the wiki-page for GPT we have two interesting entries I’m going to try and explain in more depth. The first is iterative chaining. The second is the survey trick.

Iterative chaining suggests that,

For example, supposes you’ve got a set of news headlines but don’t know what the labels should be. That’s fine! Just start with some prompts, let the API figure out how it wants to label them with your initial prompts as guidance, and then feed the API-labelled prompts back into the API for those and new headlines letting the labels evolve over time.

The survey trick suggests that we can insist GPT qualify certainty in its output on a scale from 1-5 or 1-7. Now, in reality, GPT doesn’t work well with numbers so there are still some inconveniences to work out on that front. However, at the moment survey-scales such as,

Not at all, a little, a moderate amount, a lot, or a great deal?

seem to work. Technically speaking, because of assumption one we never actually need to observe this output. Instead at the end of each response, we directly can calculate the probabilities of the various evaluations from GPT.

Iterative Chaining

The main advantage of doing something like this is that we can effectively hack in conditioning on the tail of an output. For example, if we only take outputs that are evaluated with a “great deal” of confidence we end up conditioning on high confidence outputs.

In a previous post, some machinery is introduced to think about structured tasks that we might give a language model. The gist is that I introduce something called a recurrent task so that I can think of tasks as being properly sampled from a distribution. You can likely avoid reading this if you think of recurrent tasks as random samples of query/​answer pairs from a distribution .

Say we have a mapping that we can use to order query/​answer pairs with. Consider the recurrent task that consists of mapping queries to answers and the evaluation task that consists of mapping query /​ answer pairs to evaluations. The existence of makes an evaluable recurrent task. We want to show it’s reasonable to wonder whether or not these two tasks are enough to get GPT to learn to accurately sample from .

Iterative Classification

It’ll be easier if we consider an example first. Say we have a recurrent task implied by and we want GPT to match the distribution. Specifically, we’re outputting answers to factual questions. Thus, we can evaluate whether or not an answer is correct. Given this, we use some amount of context to condition the model and then test GPT. So we evaluate . Note that if we gave examples we’d have and that .

Our basic problem is that GPT zero-shot does poorly. It has an error rate of . However, on the evaluation task, the probability that it lets through a true/​false positive is /​ . Say we let the evaluation task manage the recurrent task in the following sense:

  1. Allow GPT to answer the next query.

  2. Allow GPT to predict the evaluation.

  3. If the evaluation returns as TRUE append the the q/​a pair to a buffer

  4. If buffer is large enough append to context and repeat

Will this algorithm work? Yes, as long as . The probability of a false positive append is . The probability of a true positive append is . In expectation the proportion of appends that will be false negatives will be, Why do I call this ? Well, once we start appending the positive answers to the context GPT will adapt it’s error rate. The rate of improvement is likely difficult (impossible) to analytically figure out. However, if I invoke assumption one then after enough examples are appended GPT will figure out that sometimes it should output correct answers. This gets us to . Naturally, we setup a recurrence, Since we have, You could get into hairy details on convergence rate, but hopefully there’s intuition for how this process ends up working. If you keep going in this direction you work towards a description of boosting.

Selecting on GPT Output Is RL

In general, if the output from GPT is cherry-picked then GPT does a deformed version of reinforcement learning. We just saw that somehow the language model is able to iteratively bootstrap itself by conditioning over a special type of recurrent task. Ultimately, this explanation will serve to explain the success of generative pretraining. Let’s get into more detail about the process.

In the original paper for GPT, OpenAI’s goal was to estimate the log probability of the next token given a context as a form of pre-training for downstream language tasks. Alternatively, we want to maximize the negative log probability of text streams, When we sample from the model we commonly will use a temperature parameter that converts the log-probability back into regular probability, where I ignore the normalization factor. So then according to the thermodynamic interpretation, we actually have a population of attempts at matching that show up according to their, now weighted, probability under GPT. As we send the temperature to zero the only policies that survive are the ones that stick to the classical MLE critical path.

The reason we’re introducing all of this machinery is so that we can understand what happens when the user cherry-picks. The beauty of recurrent tasks here is that they are being sampled from which means that we have finite length episodes. Let’s hypothesize that the human overseer has slightly different criteria than that implied by called that they want to select for. This is the mapping from above that orders query /​ answer pairs, however, now we think of as assigning probability to outputs.

Alternatively, we can think of as a model of the probability that a human overseer will cherry-pick a given output. The trick is that we can convert into a soft-selection mechanism by allowing pairs to reproduce with probability . The new probability model is special in that it is history independent across multiple query /​ answer pairs (permutation invariant). This means that can be used to augment the recurrent task into something we’ve been calling an evaluable recurrent task.

First, note that enforcing the selection criteria will eventually adapt to . This is because when we have an output we allow it to survive with probability . By assumption one, GPT can encode . Thus, as we start adding to the context GPT will adapt to . The normal way of saying this is that cherry-picking rollouts conditions for future queries.

Second, this implies that having an appropriate prompt or answers for the queries is a sufficient, but not necessary condition for getting good results, in terms of , from GPT. While waiting for adaption to occur may be slow, in terms of rollouts, it will still happen. In essence, examples just start us further along in the process.

By assumption three, we can design co-recurrent tasks that evaluate query/​answer pairs using for a given recurrent task. Putting everything together, we conclude that sufficiently complex language models can oversee their own adaption to a general class of tasks I’m calling evaluable recurrent tasks.

The kicker is that, as I previously showed in a previous post, selection pressure on a distribution of mixing replicators leads to reinforcement learning as we take the temperature to zero. We can study the population dynamics of viral strains with the mutation matrix using, Let’s switch to the MDP setting. Assume the actions of the strain(s) have a deterministic effect on the environment transitions. We’re going to interpret the rewards as a fitness score allowing the agent to continue propagating. First, let the transition matrix for the system be given as . Second, transform each reward to the fitness . Note that up to scaling this is identical to what we do with log-probability. The Quasispecies formula relates the population of individuals at each state after one stage of replication after we set .

The individuals aren’t intelligent. Instead, the fitness controls the replication rate of transitions. If then the transition is neutral and the number of individuals collected on a state is neither amplified nor diminished. If then the transition is extremely harmful and if the transition is extremely helpful.

Notice that we can study the space of all possible transitions and conclude that, To make further progress, remember that actions have deterministic effects so it’s okay to assume individuals are fully random in their exploration. This allows us to simplify the product into, In words, we have decomposed the evolution of the population into a summation over all the paths the strains could take through the system. The twist is that each path is weighted by an exponential term proportional to the reward that path receives from the environment. Philosophically, this has the same spirit as the path integral approach used in physics. If we send in the path integral, this is the thermodynamic limit, we’ll get back the equations for classical motion. The claim is that the dynamics reinforce only the optimal paths in this limit.

It’s precisely because GPT uses log-probability /​ temperature sampling that the mapping is so clean. GPT outputs are already probabilistic. Thus, the replication probabilities are real. In our context, we sample from until GPT learns how to reproduce the distribution we have in mind. Humans or GPT itself are the selectors in this process. Moreover, ultimately, we don’t even want to do RL because we want to sample from . Instead, we end up with a quasi-species cloud of viable samples. The beauty of this interpretation is that it implies that selection-pressure (reward function) is equivalent to modifying the underlying replication rates.

People have correctly pointed out that a single instance of GPT cannot learn. I’m not addressing that claim here. Instead, I’m suggesting that if we follow out the math a population of GPT under selection pressure from one another can constructively adapt to evaluable recurrent tasks. With those qualifications, let’s be speculative. Define a collection of interfacing GPT- as -G-. Assume the GPT can model interfaces to our computers. I conjecture that you can draw a convex phase-diagram where being in the epigraph is sufficient for -G- to amplify itself to any -G-.