Thank you for providing a good introduction and arguments in favour of this research direction. Whilst I strongly agree with the idea of safety pre-training being valuable (and have even considered working on it myself with some collaborators), I think there are several core claims here that are false and that ultimately one should not consider alignment to be solved.
RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.
The output distribution of an SFT’d model is not the training distribution, even with cross-entropy loss, unless you’re training on non-adversarial data and sampling the model with no conditioning. Data poisoning attacks can happen and influence outputs much more strongly than would be expected by the proportion of them in the training data. When you prompt an LLM to be a capable chat or agentic model, you’re already moving quite far out of its normal training distribution, and so you cannot make good predictions on how it will behave based on general proportions of good / bad training data and its annotations.
A lot of my probability mass for AIs doing bad things comes from a mixture of out-of-context reasoning and situational awareness leading to unexpectedly intelligent behaviour. Models have been shown to be capable of doing these. I’d predict both of these (especially co-occurring) would degrade the extent to which an alignment token / conditioning would work.
There might be other capabilities models acquire on the way to stronger performance like the aforementioned ones that interact with this in a negative way.
I don’t think this actually addresses inner alignment as effectively as you imagine. I think in the situation you’re considering where you prompt the model with this alignment conditioning, it’s not guaranteed that the model will have correctly generalised the meaning of this objective from what it saw in pre-training to being a totally aligned agent. I agree it probably helps a lot, but you still face the same-old inner-alignment issues of correctly generalising something that was seen in (pre-)training to deployment, when deployment looks different to training. This is somewhat of a generalisation of points 2-4 above.
Whilst this helps a lot with outer-alignment, I don’t think it solves that completely either. Yes models are able to recognise and probably correctly annotate a lot of data for pre-training, but are we really confident this is going to effectively capture human values, or some coherent-extrapolated-volition thereof? Even with all annotations correct, this objective seems like “output nice, safe-looking text that a human might output” and not “genuinely optimise for human values”. Value alignment aside, it’s probable this method will greatly help with intent alignment, and whilst intent alignment is probably a good target for current frontier AI, it that comes with many of its own problems, primarily misuse, where another large chunk of my P(doom) comes from.
I also want to generalise points 3, 4, 6, and what Steven Byrnes is claiming into: Training a model to act like an aligned human-level intelligence is not the same as training a model to act like an aligned super-intelligence, and whatever we do to raise capabilities here may also alter or break alignment, and so cannot be relied upon.
TL;DR I think safety pre-training is probably a huge boost to alignment, but our work is far from done and there are still lots of issues / uncertainties.
RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.
I don’t pretend to be an expert on RL. However, I have read a number of papers by people who are (and give links to some of them above), and together they read to me as pretty damning.
Obviously RL can give a model new behaviors: for example, AlphaZero was trained entirely by RL from zero to superhuman at Go. However, even if were the case that RL as used in practice for aligning LLMs primarily just reinforces behaviors already in the base model (a claim that I’d love to see sources for and read more about), humans are not aligned, and have plenty of unaligned behaviors (e.g. self-interest, deceit, power-seeking, assorted vices…) that could be extremely dangerous if reinforced in an AGI (let alone an ASI), so I don’t regard that as being inherently safe.
However, this post wasn’t really intended to be a detailed critical discussion of why I think using RL for alignment is a potential x-risk: it’s a link-post, and my aim was just to remind people that many people are concerned about using RL for alignment, mostly for Inner Alignment reasons, with a brief sketch of why they’re concerned, in order to motivate why a paper proposing an alternative to RL for alignment was worth reading. For many years people have been worrying about Inner Alignment (almost) entirely in a context of aligning models with RL — using SGD instead changes the playing field for Inner Alignment dramatically. The outcome of SGD is just far more predictable, stable, and easy to reason about than RL.
The output distribution of an SFT’d model is not the training distribution, even with cross-entropy loss, unless you’re training on non-adversarial data and sampling the model with no conditioning.
I know (and briefly mentioned) that the output distribution is only approximately the training distribution. I wasn’t aware that adversarial attacks could exploit that (though that sounds inherently plausible), and I would love to read more about that — can you recommend some sources?
As for conditioning, yes, obviously so — a prompt sufficiently unlike any text found on the internet could push the model far enough out of distribution to make its output unpredictable — though obviously the response must be based on some extrapolation from the training set, predicting how the model is actually going to extrapolate could be not obvious. However, IMO that’s more a problem with the prompt than the model — just don’t use out-of-distribution prompts like that if you want predictable behavior!
Thank you for providing a good introduction and arguments in favour of this research direction. Whilst I strongly agree with the idea of safety pre-training being valuable (and have even considered working on it myself with some collaborators), I think there are several core claims here that are false and that ultimately one should not consider alignment to be solved.
RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.
The output distribution of an SFT’d model is not the training distribution, even with cross-entropy loss, unless you’re training on non-adversarial data and sampling the model with no conditioning. Data poisoning attacks can happen and influence outputs much more strongly than would be expected by the proportion of them in the training data. When you prompt an LLM to be a capable chat or agentic model, you’re already moving quite far out of its normal training distribution, and so you cannot make good predictions on how it will behave based on general proportions of good / bad training data and its annotations.
A lot of my probability mass for AIs doing bad things comes from a mixture of out-of-context reasoning and situational awareness leading to unexpectedly intelligent behaviour. Models have been shown to be capable of doing these. I’d predict both of these (especially co-occurring) would degrade the extent to which an alignment token / conditioning would work.
There might be other capabilities models acquire on the way to stronger performance like the aforementioned ones that interact with this in a negative way.
I don’t think this actually addresses inner alignment as effectively as you imagine. I think in the situation you’re considering where you prompt the model with this alignment conditioning, it’s not guaranteed that the model will have correctly generalised the meaning of this objective from what it saw in pre-training to being a totally aligned agent. I agree it probably helps a lot, but you still face the same-old inner-alignment issues of correctly generalising something that was seen in (pre-)training to deployment, when deployment looks different to training. This is somewhat of a generalisation of points 2-4 above.
Whilst this helps a lot with outer-alignment, I don’t think it solves that completely either. Yes models are able to recognise and probably correctly annotate a lot of data for pre-training, but are we really confident this is going to effectively capture human values, or some coherent-extrapolated-volition thereof? Even with all annotations correct, this objective seems like “output nice, safe-looking text that a human might output” and not “genuinely optimise for human values”. Value alignment aside, it’s probable this method will greatly help with intent alignment, and whilst intent alignment is probably a good target for current frontier AI, it that comes with many of its own problems, primarily misuse, where another large chunk of my P(doom) comes from.
I also want to generalise points 3, 4, 6, and what Steven Byrnes is claiming into: Training a model to act like an aligned human-level intelligence is not the same as training a model to act like an aligned super-intelligence, and whatever we do to raise capabilities here may also alter or break alignment, and so cannot be relied upon.
TL;DR I think safety pre-training is probably a huge boost to alignment, but our work is far from done and there are still lots of issues / uncertainties.
I don’t pretend to be an expert on RL. However, I have read a number of papers by people who are (and give links to some of them above), and together they read to me as pretty damning.
Obviously RL can give a model new behaviors: for example, AlphaZero was trained entirely by RL from zero to superhuman at Go. However, even if were the case that RL as used in practice for aligning LLMs primarily just reinforces behaviors already in the base model (a claim that I’d love to see sources for and read more about), humans are not aligned, and have plenty of unaligned behaviors (e.g. self-interest, deceit, power-seeking, assorted vices…) that could be extremely dangerous if reinforced in an AGI (let alone an ASI), so I don’t regard that as being inherently safe.
However, this post wasn’t really intended to be a detailed critical discussion of why I think using RL for alignment is a potential x-risk: it’s a link-post, and my aim was just to remind people that many people are concerned about using RL for alignment, mostly for Inner Alignment reasons, with a brief sketch of why they’re concerned, in order to motivate why a paper proposing an alternative to RL for alignment was worth reading. For many years people have been worrying about Inner Alignment (almost) entirely in a context of aligning models with RL — using SGD instead changes the playing field for Inner Alignment dramatically. The outcome of SGD is just far more predictable, stable, and easy to reason about than RL.
I know (and briefly mentioned) that the output distribution is only approximately the training distribution. I wasn’t aware that adversarial attacks could exploit that (though that sounds inherently plausible), and I would love to read more about that — can you recommend some sources?
As for conditioning, yes, obviously so — a prompt sufficiently unlike any text found on the internet could push the model far enough out of distribution to make its output unpredictable — though obviously the response must be based on some extrapolation from the training set, predicting how the model is actually going to extrapolate could be not obvious. However, IMO that’s more a problem with the prompt than the model — just don’t use out-of-distribution prompts like that if you want predictable behavior!