Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.

Many people take it for granted that this is extremely unlikely to work. The central worry is that, given a loss-function and a set of examples of correct/​aligned behavior, such approaches reliably create AIs that get low loss on the training samples, but give no us no control over what internal mechanism those AIs developed to do that. Consequently, these approaches give us little reason to expect that the AIs will generalize correctly. In fact, smart enough agents will produce low-loss outputs for instrumental reasons, and this means that there’s a nearly infinite set of messed up values that nevertheless give low loss. And since only a very small set of values give good outcomes for humanity if optimized by an ASI, the outcome of training an ASI or to-be-ASI with such a procedure will almost certainly yield very bad outcomes for humans.

The most central empirical example of this process unfolding is human evolution. Humans were optimized for genetic fitness, and developed internal desires/​drives/​values that were well suited to maximizing that in our ancestral environment (low loss on training samples), but evolution didn’t actually make humans care about inclusive genetic fitness for its own sake, and when we were subjected to a distributional shift (agriculture, civilization, modern technology and culture), the internal drives humans had acquired failed to generalize, and almost no modern humans act in ways that maximize their inclusive genetic fitness.

Core Reason its not Obvious this Applies to AIs

There are many differences between human evolution and gradient descent, but I think most of them are not that important. The difference I think is important is this: humans in our ancestral environment didn’t know about inclusive genetic fitness, but modern AIs do to a large degree understand/​know about the values we’re trying to put into the AIs (kindness, corrigibility, humans feeling joy and so on).

What this means is: there is no compact way for evolution to point to the “this human specimen values genetic fitness”-knob on the genome and turn it up.

But this is not the case for modern AIs. They have some understanding of what kindness is, and there should in principle be a way to locate that inside the LLM and turn up the knob.

How do we use this to Align Anything?

Important point about finetuning and RL

So, basically all modern techniques for training an LLMs to have a certain skill or proclivity consist in

  1. Defining some metric that determines how much you like an LLMs output

  2. Sample from the LLM

  3. Make local update to parameters of your model so the token outputs you “liked” according to the metric become more likely.

Okay, so this makes the LLM more likely to produce the tokens your metric liked, but can we say anything stronger? I think we can. This isn’t a perfect description, but I think a more insightful way of thinking about RL done on LLM is this: Finetuning a mature LLM strengthens the processes inside the LLM that actually caused the LLM to produce the output you rated highly.

This is a somewhat subtle point (and should maybe be explained in more detail), but its quite a different perspective than the one underlying the argument made in the beginning of this essay. It also sheds light on many observed phenomena in DL. Like the fact that finetuning often gives better generalizing results if you prompt an AI for the behavior you want during training, before showing it the SFT examples of the exact target behavior. Or the fact that in the alignment faking paper, the training-setup increased alignment faking, even though they were only training to remove refusal.

How to Align AI using RLHF

So what does the above frame tell us about the effects of doing SFT and RLHF on a pretrained LLM? Well, I think it tells us we should expect a such methods to (initially at least) boost the internal mechanisms inside the pretrained LLM that actually caused the LLM to output such tokens. What might that be? Well, if you throw the samples at the AI cold, there might not be any reasonable way to the LLM to generate those tokens, so the learning will be dominated by boosting shallow things, like the unembedding vector of nasty-word tokens shrinks, pleasant word tokens go up. If you give it a prompt that gets it to start larping as a scifi AI-assistant, you’ll boost the internal circuitry that goes into larping like a scifi AI assistant. If you write out a dialogue between two humans, one who is very kind, and then have one human ask a question, and the very kind one answers with your SFT sample, you might be able to boost the “kindness” already inside the LLM, provided your SFT sample is something like what the kind human could’ve said in that conversation.

Key Consideration

If you start off with a very smart agent that reasons its way towards giving the correct answers during the SFT or RLHF process you’re doing, the cause of the right answers was the reasoning (and some random value you don’t know about), rather any internal circuitry having to do with kindness (or other alignment target). This will be strengthened, and your finetuning will do little to align the agent.

This is a curse but also a blessing, it means with a smart enough agent, even if there are errors in your alignment training examples, that won’t hurt the outcome very much.

Importantly, it also means, if you start out with an agent dumb enough that you can align it in this way, you could boost its general intelligence, for example by training it on math questions, without worrying that this process causes misalignment, because the reason it would try to answer your questions in the first place, would from the beginning be that it wanted to help you, and that would be strengthened from every sample.

Why This Might Fail

  1. If you give positive rewards to “tokens resulting from simulating kind character”, there had to be some circuitry telling the agent that “nice character” is the right role to play. Whats the consequence of this being “strengthened”. That circuitry is quite sophisticated in an LLM, but if we do the RL correctly it should learn to just output “nice character, nice character, nice character...”, provided its not independently intelligence enough to do alignment-faking-like reasoning, which it shouldn’t be.

  2. LLMs have an understanding of “kindness” in the sense that you can talk to them about it, but this verbalized notion of kindness might not be the same as their internal conception which powers simulations of nice characters. Tokenization might be hiding some alienness.

  3. Even if they do have a robust internal conception like that (eg imagine there is a SAE feature inside the LLM which you can steer with and get exactly the results you want/​expect, and whose activation patterns perfectly match what you intuitively associate with kindness), very small differences in conceptualization can kill you for goodheart-reasons, if optimized by a sufficiently powerful agent.

    1. For this reason, I think anyone who tries to align an AI using the ideas described in this post should be aiming for corrigibility rather than value alignment.

  4. see final note

Final Note

To avoid making this post too long and technical I’ve ignored some important complicating elements. For example, LLMs don’t output tokens, they output probabilities over tokens. And the probability of any individual token isn’t the result of a single discrete unit inside the LLM, but the sum of many such sometimes-discrete-ish-sometimes-garbled-mess processes.

I’ve thought through some of these, and think the core idea has merit despite them, but some I’m still uncertain about, and there’s enough uncertainty that maybe the idea doesn’t work at all.

I’m still very interested in hearing what people think. Similar takes have been posted before, but I haven’t heard a single robust take-down.

Also, like I said, I’m aware some descriptions here are underspecified, please ask if something is unclear.