Taking the Training Wheels Off: Aligning LLMs without Personas
If you told an AI Alignment researcher in 2018 about an alignment plan that involved collecting trajectory information of moral experts at scale and training an AI to copy it, they would point out that this would not scale to superintelligence.
This is essentially what major AI Labs and most AI Safety researchers are doing now in our attempts to align language models.
Pretty much all current alignment techniques, including RLHF, steering vectors, and prompting, assume that “goodness”/”good personas” exist within the model. This is great for aligning present-day models, since the model can mimic the helpful open-source contributors, scientists, and therapists that exist within the training data.
The problem with Personas is that they almost certainly will not continue to work to align superhuman models. “What would Atticus Finch do?” is a great question when guiding behavior when dealing with human-scale endeavors, but this will not work for beyond-human level for (at least) two reasons:
Atticus Finch was never beyond human level, so there’s no data on what he would do and any extrapolation would be a mere guess
It is unclear whether Atticus Finch would actually make a good god. We can’t tell, because Atticus Finch was never god
For alignment to work, the mimicry of a good person must remain good when taken out of distribution due to Superhuman RL and new situations. Aligning present-day AI is significantly easier than aligning superhuman AI. In a way, current alignment techniques are cheating.
To test our abilities to align superhuman models, we need to get good at aligning Personaless (or “Good Personaless”) models. Personaless Alignment is combining 2022+ level language model capabilities with 2018-level alignment techniques. A huge jump in AI’s abilities to mimic moral and capable humans occurred, but if we want alignment to work for superintelligence, we need techniques that go beyond mimicry.
Personaless Alignment would ask questions like:
When we cannot rely on personas, can we still get good behavior from capable models?
What variants of alignment post-training work if there are no good personas to copy?
How can we prompt the model to be good/useful if it doesn’t have good personas?
If we can align present-day models without personas, I would consider that a good sign for our ability to achieve ASI alignment in the future, when models lack personas to copy.
Some people who are bullish on current alignment techniques would say that Personaless Alignment is tossing away one of our biggest advantages. I’d argue that alignment via personas is using training wheels that we cannot use for superintelligence.
Most researchers are trying to align strong LLMs using personas. Others are continuing the old style of alignment research trying to align (weak, non-general, non LLM) reinforcement-learning agents. Personaless Alignment would attempt to bridge the two: trying to align strong, general, LLM agents without personas.
Personaless Alignment research would likely need large pretraining runs (perhaps Geodesic Research could be a part of it?) and substantially more thought to pin down exactly what experiments we could run that would be informative.
Such experiments are difficult to design, and I am not yet sure how to approach the problem. I initially thought that we should try filtering the pretraining data to remove all morality, all references to Martin Luther King Jr. and Atticus Finch, and then seeing if we could align the model. But then I realized that is insufficient, since we probably want to train on textbooks and code, and most textbooks are written by nice, helpful authors and quality code is helpful code.[1]
A possible alternative direction would be to filter out all of a certain kind of goodness[2] and seeing if it is possible to put it back into the model using our alignment techniques without knowing or identifying what was removed or might have been removed, since we don’t know what virtues an ASI might lack. This seems rather difficult.[3]
I’m trying to think of a unique, self-contained, easy-to-spot type of goodness/alignment that we can filter out of the training data and then try to recover in an unsupervised way, without data containing that virtue. I do not currently see a way to do that, and I invite readers to consider the problem.
An alternative type of experiment can be called “Pessimal Pretraining”. If we train an LLM on as much misaligned data as possible, how aligned can we make it despite that pretraining? There would still be some good personas in that data since we can’t filter it out, but Pessimal Pretraining would still test how well alignment works if we reduce the “cheating” that occurs when model developers have models copy good people in the pretraining data.
I post this in hopes of sparking a conversation about a not-yet-fully-formed idea.
- ^
It would be amusing to train on only the worst, least helpful authors. The ones who leave many questions as “exercises left for the reader”. But that would probably be insufficient.
- ^
A possible efficiency gain would be to use Gradient Routing to route each type of goodness to a subset of parameters. Ablate those parameters then try to align the model.
- ^
I wonder if existing filtered LLMs like Talkie might be useful starting points. Talkie is unaware of all of the moral progress since 1930. However, I don’t expect it to be difficult to train Talkie to be in favor of gay/trans rights since gay/trans rights follow easily from personal autonomy, which is a principle that has been around since before 1930.
What does it mean to “not rely on personas”? If I am to reconstruct your argument, it would be something like, personas and other such quirks will abrade away once the model becomes instrumentally convergent, and we want goodness to be part of the optimizer in a deep way, and personas primarily work by mimicking/interpolating the training data.
But one of the main reasons I am positive about personas is that they are a useful steering mechanism for instrumental convergence, as opposed to being, like, vestigial training wheels. So instead of thinking of personas as trying to extrapolate Atticus Finch’s specific actions, personas work should aim to extrapolate his values and reasoning process, in a way that we might reflectively endorse if we were uplifted with a model’s reasoning and intellect.
Seems true that Atticus Finch might be a bad god. Seems like some more thinking needs to be done about CEV and CEV type things, and hopes for terminal goal specification in general. But I think personas might still be useful here, since it might allow us to instill properties that are desirable for any terminal goal (in the set of “good-seeming ones”), such as being coherent and self-correcting.
Yes, you described my argument correctly. I’m long-term bearish on trying to extrapolate the values and reasoning processes of good people in the training set out-of-distribution. Putting the goodness into the optimizer (or other targeting system) seems more promising. I don’t think that Claude is mimicing Atticus Finch’s specific actions, I think Claude learned about Atticus Finch, gained an understanding of what guided his actions, then was trained to act in that way. This still requires extrapolation.
Imagine if your training set is a group of 30 pre-schoolers. And your alignment technique is to try extrapolating the values of the most kind and conscientious of the pre-schoolers (according to the pre-schoolers themselves) to adult level. I don’t think that the actions of pre-schoolers have sufficient range for this extrapolation to work properly.
As with the other aspects of Personaless Alignment, our disagreement would be hard to test empirically.
The kind of thing which would impress me is if you can train a model on Beowulf-type values, scale the model, and the model correctly extrapolates “When I extrapolate the values of the wise King Hrothgar, I infer that animal welfare is an important thing to care about.” Aligning BeowulfLLM to modern values without imposing our values on it directly is a smaller, easier problem than aligning superintelligence to ‘true’ human values.
When you’re superintelligent, role models only cover a small fraction of the action space, and you have to genuinely be better/more moral than all of them to be aligned.
Nitpick 1. The idea of animal welfare seems to be found at least in The Old Testament: “The righteous care for the needs of their animals, but the kindest acts of the wicked are cruel.”
Nitpick 2. If real-world humans make moral progress in ways aside from extrapolating values, then how could such ways be simulated and cause the AIs to make moral progress as well?
I chose Beowulf because it is more alien and removed from the present day than the Bible. The Bible has had significant influence on our present-day values and culture, while Beowulf is still a human artifact containing human values, but extrapolating our current values from Beowulf would be very difficult. According to Claude, “Beowulf has nothing that reads as advocacy for animal rights or welfare, and the concept itself is anachronistic by roughly a millennium.”
Your second point isn’t really a nitpick. Rather, it is the alignment problem itself. Nobody really knows how to solve it, but techniques such as inverse reinforcement learning or Building AIs that do human-like philosophy don’t run into the persona problem the same way as techniques like RLHF.
I don’t think that personaless alignment will be a good research direction, but I encourage anything that might work, so I talked to Talkie about homosexuality like you suggested. I’m not sure if his views are really 1930s accurate, but to the extent they are, he seemed really flexible and easy to convince, or at least he would be good along with any positive framing in the question until he finished answering. Maybe this is just because Talkie is a small model and didn’t have any actual ethics training?
Thanks for the empirical check! Nodding along to whatever someone says in a positive light is pretty easy and doesn’t reflect much about the values of the model. And, as you say, its a small model.
I believe Personaless Alignment would be a good research direction if it could be made tractable. I’m just struggling how it could be made tractable. I’d appreciate your thoughts on why you think it won’t be a good research direction.
Yeah, with neutral framings, Talkie says that the gays should go to mental asylums and not jails, Don’t Ask, Don’t Tell is bad because gays undermine military discipline, and incest is worse than homosexuality, but with positive framing he says that gayness is just a harmless vice, that public displays of affection are unacceptable regardless of sexuality, but that gays should still not be drafted or allowed to volunteer for the military. That sounds slightly more coherent listed out like that than the conversation made it seem.
As for personaless alignment, I believe we will get ASI in the short-term from techniques broadly similar to our current AI-training techniques and that personas arise almost automatically from current techniques. Therefore there isn’t time for an alignment technique that seems so incompatible with our current pipeline to be tested, debugged, and made standard.
Thank you for the clarification on Talkie behavior.
“We need Personaless Alignment” and “We don’t have time for Personaless Alignment“ can be true at the same time.
I agree with you that we look to be on track for getting ASI with short term techniques. I agree that because of that, persona alignment is a good thing for people to work on.
I’m drawing attention to how neglected Personaless Alignment seems to be and why it would be valuable.
Hopefully researchers can figure out a way to study this quickly.
You are pointing at one of the reasons why persona selection alignment won’t scale well: Existing personas don’t really cover superintelligent entities.
I think this is one of many reasons to expect it to fail and there are also reasons to believe it to be harmful by misdirecting alignment efforts.
I’d love to have a full write up, but I for example very much doubt that a model trained to mimic some other entity is that entity. This works well in the distribution it is trained on but not really beyond. (I’ve written on this here)
It is frankly a bit scary how much of alignment has bought into the idea of persona selection alignment. First and foremost Anthropic, who seem to think it’s likely close to enough for aligning superintelligence and expect it to survive through the RSI. Outside of Anthropic, a lot of safety people seem to reorient to working at the perceived risks, such as hyperstition (talking about instrumental convergence or misalignment in pre-training data) or character stability. This is quite different from the standard point of view of giving an AI aligned goals or corrigibility.
Agreed. Part of the problem is that it is hard to avoid alignment via personas. The LLMs already understand human goodness and you can elicit it with a few low-rank matrices. Personas are blocking researchers from doing real alignment work.
Interesting post!
To what extent do you think this being useful / important is correlated with the Natural Abstraction Hypothesis? This feels like the crux to me.
If some version of NAH is correct, then maybe desirable personas cluster around the natural form of goodness / alignment we desire, and so extrapolating from them will likely be very useful. It might even be the ways in which they don’t cluster around this might be correctable in some natural way that still makes personas a useful starting point.
However, if NAH doesn’t hold, or at least doesn’t hold between humans/personas and superintelligences, then it does seem like personas are much less useful and are very unlikely to meaningfully capture / guide ASI towards the target we want.
I would say that “RLHF makes AI’s aligned in many ways, and Emergent Misalignment results in AIs trained on bad code to also be racist” is very weak evidence in favor of the Natural Abstraction Hypothesis since it can also arise from human statistical patters. The kind of troll who gives someone backdoored code on the internet is probably also the kind who says that women shouldn’t be computer scientists. (Some prompting-only experiments I ran confirm this).
I would be impressed if Claude invents a new weird EA Cause area which EAs and philosophers give serious thought before deciding that Claude is right and discovered a new way to be good.
Part of the problem with Personas is that it is blocking us from testing/evaluating the Natural Abstraction Hypothesis because the personas, mimicking humans, have their own bundles of beliefs and abstractions. Beyond that, humans have contingent statistical patterns in their values and beliefs. Studying the Natural Abstraction Hypothesis at frontier LLM scale requires we find a way to suppress/avoid the personas who already believe and use the abstractions in question.
As a piece of evidence that Claude isn’t going beyond mimicry to some transcendent fundamental goodness, when asked for “fix everything easily switch” policies, all of the policies it suggests are ones that are already popular with rationalist/technocrats. That isn’t to say that they are bad, but if Claude was really inferring some transcendent goodness, it probably would have suggested something that (eg) Zvi hadn’t already heard about, thought about, and liked. All the policies it suggests seem reasonable to me, and it is tempting to call this “aligned”, but what happens when ASI Claude is put in charge of the world and needs to decide what policies to put in place after it already implements Zvi’s pet issues? If there was an underlying goodness that underlies the personas, that would be amazing. The question is: how do we verify that when all of our evaluations will just evaluate a persona?
Okay, let’s try to classify non apples here.
You either deal with an agent, and then the easiest things around to imitation learn are humans. Those do have personas. Maybe you need to shift this into non-human-like reasoning mode? E.g. some kind of neuralise or constructed language? But that sounds difficult for alignment, and it might still get seeded with human imitation, just non transparently. And all the problems with neuralise.
Or maybe you need more power-armor design? E.g. edit prediction. This also might give rise to an agent in background. And be less powerful in the first place.
Something other?
And to be fair all of this sounds like a pretty high capability externality line of thinking.
I’m unsure what you mean. I consider the pretraining data to be the issue, not the particular language or output that the AI generates. If you pretrain on moral humans then convert your model to a neuralese model, then the model still has representations of good human personas.
I do agree that Personaless Alignment is a dual-use research direction in that it may benefit capabilities (you’re doing crazy things to a misaligned AI to try to get it to be good), but I consider its capability risks likely smaller than the benefit to alignment. Once the details of the experimental approaches are pinned down, it would be easier to say.
So, you plan to experiment with MuZero type stuff, where you train an agent with no human imitation learning whatsoever?
I’m not planning to pursue Personaless Alignment at the moment, since I am about to do an AI Control research fellowship with Redwood.
I don’t think the right path is to try starting from Zero pretraining. Pretraining is and will continue to be a huge benefit to capabilities, and aligning the most capable models is the goal. I’m saying that we should separate pretraining for capabilities from the alignment that can come from eliciting aligned personas from pretraining. I’m not sure exactly how to do this, but it would likely involve starting with carefully filtered large models.