Taking the Training Wheels Off: Aligning LLMs without Personas

If you told an AI Alignment researcher in 2018 about an alignment plan that involved collecting trajectory information of moral experts at scale and training an AI to copy it, they would point out that this would not scale to superintelligence.

This is essentially what major AI Labs and most AI Safety researchers are doing now in our attempts to align language models.

Pretty much all current alignment techniques, including RLHF, steering vectors, and prompting, assume that “goodness”/​”good personas” exist within the model. This is great for aligning present-day models, since the model can mimic the helpful open-source contributors, scientists, and therapists that exist within the training data.

The problem with Personas is that they almost certainly will not continue to work to align superhuman models. “What would Atticus Finch do?” is a great question when guiding behavior when dealing with human-scale endeavors, but this will not work for beyond-human level for (at least) two reasons:

  • Atticus Finch was never beyond human level, so there’s no data on what he would do and any extrapolation would be a mere guess

  • It is unclear whether Atticus Finch would actually make a good god. We can’t tell, because Atticus Finch was never god

For alignment to work, the mimicry of a good person must remain good when taken out of distribution due to Superhuman RL and new situations. Aligning present-day AI is significantly easier than aligning superhuman AI. In a way, current alignment techniques are cheating.

To test our abilities to align superhuman models, we need to get good at aligning Personaless (or “Good Personaless”) models. Personaless Alignment is combining 2022+ level language model capabilities with 2018-level alignment techniques. A huge jump in AI’s abilities to mimic moral and capable humans occurred, but if we want alignment to work for superintelligence, we need techniques that go beyond mimicry.

Personaless Alignment would ask questions like:

  1. When we cannot rely on personas, can we still get good behavior from capable models?

  2. What variants of alignment post-training work if there are no good personas to copy?

  3. How can we prompt the model to be good/​useful if it doesn’t have good personas?

If we can align present-day models without personas, I would consider that a good sign for our ability to achieve ASI alignment in the future, when models lack personas to copy.

Some people who are bullish on current alignment techniques would say that Personaless Alignment is tossing away one of our biggest advantages. I’d argue that alignment via personas is using training wheels that we cannot use for superintelligence.

Most researchers are trying to align strong LLMs using personas. Others are continuing the old style of alignment research trying to align (weak, non-general, non LLM) reinforcement-learning agents. Personaless Alignment would attempt to bridge the two: trying to align strong, general, LLM agents without personas.

Personaless Alignment research would likely need large pretraining runs (perhaps Geodesic Research could be a part of it?) and substantially more thought to pin down exactly what experiments we could run that would be informative.

Such experiments are difficult to design, and I am not yet sure how to approach the problem. I initially thought that we should try filtering the pretraining data to remove all morality, all references to Martin Luther King Jr. and Atticus Finch, and then seeing if we could align the model. But then I realized that is insufficient, since we probably want to train on textbooks and code, and most textbooks are written by nice, helpful authors and quality code is helpful code.[1]

A possible alternative direction would be to filter out all of a certain kind of goodness[2] and seeing if it is possible to put it back into the model using our alignment techniques without knowing or identifying what was removed or might have been removed, since we don’t know what virtues an ASI might lack. This seems rather difficult.[3]

I’m trying to think of a unique, self-contained, easy-to-spot type of goodness/​alignment that we can filter out of the training data and then try to recover in an unsupervised way, without data containing that virtue. I do not currently see a way to do that, and I invite readers to consider the problem.

An alternative type of experiment can be called “Pessimal Pretraining”. If we train an LLM on as much misaligned data as possible, how aligned can we make it despite that pretraining? There would still be some good personas in that data since we can’t filter it out, but Pessimal Pretraining would still test how well alignment works if we reduce the “cheating” that occurs when model developers have models copy good people in the pretraining data.

I post this in hopes of sparking a conversation about a not-yet-fully-formed idea.

  1. ^

    It would be amusing to train on only the worst, least helpful authors. The ones who leave many questions as “exercises left for the reader”. But that would probably be insufficient.

  2. ^

    A possible efficiency gain would be to use Gradient Routing to route each type of goodness to a subset of parameters. Ablate those parameters then try to align the model.

  3. ^

    I wonder if existing filtered LLMs like Talkie might be useful starting points. Talkie is unaware of all of the moral progress since 1930. However, I don’t expect it to be difficult to train Talkie to be in favor of gay/​trans rights since gay/​trans rights follow easily from personal autonomy, which is a principle that has been around since before 1930.