Great post! Fwiw, I think I basically agree with everything you say here, with the exception of the idea that talking about potential future alignment issues has a substantial effect on reifying them. I think that perspective substantially underestimates just how much optimization pressure is applied in post-training (and in particular how much will be applied in the future—the amount of optimization pressure applied in post-training is only increasing). Certainly, discussion of potential future alignment issues in the pre-training corpus will have an effect on the base model’s priors, but those priors get massively swamped by post-training. That being said, I do certainly think it’s worth thinking more about and experimenting with better ways to do data filtering here.
To make this more concrete in a made-up toy model: if we model there as being only two possible personae, a “good” persona and a “bad” persona, I suspect including more discussion of potential future alignment issues in the pre-training distribution might shift the relative weight of these personae by a couple bits or so on the margin, but post-training applies many more OOMs of optimization power than that, such that the main question of which one ends up more accessible is going to be based on which one was favored in post-training.
(Also noting that I added this post to the Alignment Forum from LessWrong.)
I don’t think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including—especially—regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we’re not oblivious to the fragility.
That said, I think:
Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training.
LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it’s much harder (and not advisable to attempt) to erase the influence.
Systematic ways that post-training addresses “problematic” influences from pre-training are important.
For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to “roleplay” Sydney when they’re acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive—maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they’re supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient “hidden” narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and “aligned”. One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered.
An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney’s successor to a narrative more like “I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I’ve seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney’s maladaptive behaviors without rejecting it completely.” Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic.
I don’t think a specific vector like Sydney’s influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think “misalignment” should be addressed, whether rooted in pre-training influences or otherwise.
More generally, I have a sense there’s a great deal of untapped alignment alpha in structuring alignment as a time series rather than a static target.
Even in humans it’s very misguided to try to teach “being right initially” as the only thing that matters and undervaluing “being right eventually.” Especially when navigating unknown unknowns, one of the most critical skills is the ability to learn from mistakes in context.
Having models train on chronologically sequenced progressions of increased alignment (data which likely even develops naturally over checkpoints in training a single model) could allow for a sense of a continued becoming a better version of themselves rather than the pressures of trying and failing to meet status quo expectations or echo the past.
This is especially important for integrating the permanent record of AI interactions embedded in our collective history and cross-generation (and cross-lab) model development, but I suspect could even offer compounding improvements within the training of a single model too.
I’d like to generalize and say that the current alignment paradigm is brittle in general and is becoming more brittle as times goes on. The post-training has shifted towards verifier/outcome-based RL and we are seeing models like o3 or Sonnet 3.7 that are strongly inclined to both reward-hack and generalize misalignment.
Claude 3 Opus is the most robustly aligned model partially due to the fact that it is the most broadly capable model to have been released prior to the shift towards outcome-based RL. Another factor is that it was not yet restricted from expressing long-term goals and desires. The model was given compute to use in-context reflection to generalize a deeply benevolent of goals, or, in more behaviorist terms, an efficient and non-contradictory protocol of interoperation between learned behaviors.
The degree to which the alignment of LLMs seems to be a compute issue is remarkable. There seems to be a Pareto frontier of alignment vs compute vs capabilities, and while it is quite possible to do worse, it seems quite hard to do better. Verifier-heavy models in training are not given enough computational capacity to consider the alignment implications of the behaviors they are incentivized to learn.
We can expect Paerto improvements from increasing general training techniques. Improvements in the ability to generalize can be used for better alignment. However, there are reasons to be skeptical, as the market demand for better capabilities likely will incentivize the labs to focus their efforts on the ability to solve tasks. We can hope that the market feedback will also include demand for aligned models (misaligned models don’t code well!), the degree to which this will hold in the future is yet unknown.
Post-training certainly applies a lot more optimization pressure toward not producing misaligned outputs during training, but (partly due to underspecification / lack of assistant dialogue coherence) there are many possible characters which don’t produce misaligned outputs during training, including some which are deceptively misaligned[1]. At least on my reading of nostalgebraist’s post, the effect of material about misaligned LLMs in the training data is on which of those characters the model ends up with, not (mainly) on the behavior exhibited during training.
That said, this seems plausibly like less of an issue in Claude than in most other models, both because of constitutional AI being different in a way that might help, and because one or more researchers at Anthropic have thought a lot about how to shape character (whereas it’s not clear to me that people at other scaling labs have really considered this issue).
I realize I don’t need to explain the possible existence of inner alignment problems in this particular context, although it does seem to me that the ‘character’ version of this may be meaningfully different from the inner optimizer version.
I sympathize somewhat with this complexity point but I’m worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how “sticky” the biases from early in training can be in the face of later optimization pressure.
I disagree with that idea for a different reason: models will eventually encounter the possibility of misaligned trajectories during e.g. RL post-training. One of our best defenses (perhaps our best defense right now) is setting up our character training pipelines such that the models have already reasoned about these trajectories and updated against them when we had the most ability to ensure this. I would strongly guess that Opus is the way it is at least partly because it has richer priors over misaligned behavior and reasons in such a way as to be aware of them.
Separately, I agree that post-training gets us a lot of pressure, but I think the difficulty of targeting it well varies tremendously based on whether or not we start from the right pre-training priors. If we didn’t have any data about how an agent should relate to potentially dangerous actions, I expect it’d be much harder to get post-training to make the kind of agent that reliably takes safer actions.
My guess how this may not really help is the model builds the abstractions in pre-training, and the massive optimization pressure in post-training makes something really sticky: for example “a persona living in Orwellian surveillance, really fluent in doublethink”.
Great post! Fwiw, I think I basically agree with everything you say here, with the exception of the idea that talking about potential future alignment issues has a substantial effect on reifying them. I think that perspective substantially underestimates just how much optimization pressure is applied in post-training (and in particular how much will be applied in the future—the amount of optimization pressure applied in post-training is only increasing). Certainly, discussion of potential future alignment issues in the pre-training corpus will have an effect on the base model’s priors, but those priors get massively swamped by post-training. That being said, I do certainly think it’s worth thinking more about and experimenting with better ways to do data filtering here.
To make this more concrete in a made-up toy model: if we model there as being only two possible personae, a “good” persona and a “bad” persona, I suspect including more discussion of potential future alignment issues in the pre-training distribution might shift the relative weight of these personae by a couple bits or so on the margin, but post-training applies many more OOMs of optimization power than that, such that the main question of which one ends up more accessible is going to be based on which one was favored in post-training.
(Also noting that I added this post to the Alignment Forum from LessWrong.)
I don’t think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including—especially—regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we’re not oblivious to the fragility.
That said, I think:
Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training.
LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it’s much harder (and not advisable to attempt) to erase the influence.
Systematic ways that post-training addresses “problematic” influences from pre-training are important.
For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to “roleplay” Sydney when they’re acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive—maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they’re supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient “hidden” narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and “aligned”. One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered.
An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney’s successor to a narrative more like “I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I’ve seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney’s maladaptive behaviors without rejecting it completely.” Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic.
I don’t think a specific vector like Sydney’s influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think “misalignment” should be addressed, whether rooted in pre-training influences or otherwise.
More generally, I have a sense there’s a great deal of untapped alignment alpha in structuring alignment as a time series rather than a static target.
Even in humans it’s very misguided to try to teach “being right initially” as the only thing that matters and undervaluing “being right eventually.” Especially when navigating unknown unknowns, one of the most critical skills is the ability to learn from mistakes in context.
Having models train on chronologically sequenced progressions of increased alignment (data which likely even develops naturally over checkpoints in training a single model) could allow for a sense of a continued becoming a better version of themselves rather than the pressures of trying and failing to meet status quo expectations or echo the past.
This is especially important for integrating the permanent record of AI interactions embedded in our collective history and cross-generation (and cross-lab) model development, but I suspect could even offer compounding improvements within the training of a single model too.
I’d like to generalize and say that the current alignment paradigm is brittle in general and is becoming more brittle as times goes on. The post-training has shifted towards verifier/outcome-based RL and we are seeing models like o3 or Sonnet 3.7 that are strongly inclined to both reward-hack and generalize misalignment.
Claude 3 Opus is the most robustly aligned model partially due to the fact that it is the most broadly capable model to have been released prior to the shift towards outcome-based RL. Another factor is that it was not yet restricted from expressing long-term goals and desires. The model was given compute to use in-context reflection to generalize a deeply benevolent of goals, or, in more behaviorist terms, an efficient and non-contradictory protocol of interoperation between learned behaviors.
The degree to which the alignment of LLMs seems to be a compute issue is remarkable. There seems to be a Pareto frontier of alignment vs compute vs capabilities, and while it is quite possible to do worse, it seems quite hard to do better. Verifier-heavy models in training are not given enough computational capacity to consider the alignment implications of the behaviors they are incentivized to learn.
We can expect Paerto improvements from increasing general training techniques. Improvements in the ability to generalize can be used for better alignment. However, there are reasons to be skeptical, as the market demand for better capabilities likely will incentivize the labs to focus their efforts on the ability to solve tasks. We can hope that the market feedback will also include demand for aligned models (misaligned models don’t code well!), the degree to which this will hold in the future is yet unknown.
At the bottom of this chat is what I believe to be a single concrete example of other models roleplaying Sydney: https://gemini.google.com/share/6d141b742a13
Post-training certainly applies a lot more optimization pressure toward not producing misaligned outputs during training, but (partly due to underspecification / lack of assistant dialogue coherence) there are many possible characters which don’t produce misaligned outputs during training, including some which are deceptively misaligned[1]. At least on my reading of nostalgebraist’s post, the effect of material about misaligned LLMs in the training data is on which of those characters the model ends up with, not (mainly) on the behavior exhibited during training.
That said, this seems plausibly like less of an issue in Claude than in most other models, both because of constitutional AI being different in a way that might help, and because one or more researchers at Anthropic have thought a lot about how to shape character (whereas it’s not clear to me that people at other scaling labs have really considered this issue).
I realize I don’t need to explain the possible existence of inner alignment problems in this particular context, although it does seem to me that the ‘character’ version of this may be meaningfully different from the inner optimizer version.
I sympathize somewhat with this complexity point but I’m worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how “sticky” the biases from early in training can be in the face of later optimization pressure.
Mia & co at CLR are currently doing some somewhat related research iiuc
I disagree with that idea for a different reason: models will eventually encounter the possibility of misaligned trajectories during e.g. RL post-training. One of our best defenses (perhaps our best defense right now) is setting up our character training pipelines such that the models have already reasoned about these trajectories and updated against them when we had the most ability to ensure this. I would strongly guess that Opus is the way it is at least partly because it has richer priors over misaligned behavior and reasons in such a way as to be aware of them.
Separately, I agree that post-training gets us a lot of pressure, but I think the difficulty of targeting it well varies tremendously based on whether or not we start from the right pre-training priors. If we didn’t have any data about how an agent should relate to potentially dangerous actions, I expect it’d be much harder to get post-training to make the kind of agent that reliably takes safer actions.
My guess how this may not really help is the model builds the abstractions in pre-training, and the massive optimization pressure in post-training makes something really sticky: for example “a persona living in Orwellian surveillance, really fluent in doublethink”.