Thank you!
Your post was also very good and I agree with its points. I’ll probably edit my post in the near future to reference it along with some of the other good references your post had that I wasn’t aware of.
Thank you!
Your post was also very good and I agree with its points. I’ll probably edit my post in the near future to reference it along with some of the other good references your post had that I wasn’t aware of.
Yes, it’s still unclear how to measure modification magnitude in general (or if that’s even possible to do in a principled way) but for modifications which are limited to text, you could use the entropy of the text and to me that seems like a fairly reasonable and somewhat fundamental measure (according to information theory). Thank you for the references in your other comment, I’ll make sure to give them a read!
Thank you, this looks very interesting
Thank you for providing a good introduction and arguments in favour of this research direction. Whilst I strongly agree with the idea of safety pre-training being valuable (and have even considered working on it myself with some collaborators), I think there are several core claims here that are false and that ultimately one should not consider alignment to be solved.
RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.
The output distribution of an SFT’d model is not the training distribution, even with cross-entropy loss, unless you’re training on non-adversarial data and sampling the model with no conditioning. Data poisoning attacks can happen and influence outputs much more strongly than would be expected by the proportion of them in the training data. When you prompt an LLM to be a capable chat or agentic model, you’re already moving quite far out of its normal training distribution, and so you cannot make good predictions on how it will behave based on general proportions of good / bad training data and its annotations.
A lot of my probability mass for AIs doing bad things comes from a mixture of out-of-context reasoning and situational awareness leading to unexpectedly intelligent behaviour. Models have been shown to be capable of doing these. I’d predict both of these (especially co-occurring) would degrade the extent to which an alignment token / conditioning would work.
There might be other capabilities models acquire on the way to stronger performance like the aforementioned ones that interact with this in a negative way.
I don’t think this actually addresses inner alignment as effectively as you imagine. I think in the situation you’re considering where you prompt the model with this alignment conditioning, it’s not guaranteed that the model will have correctly generalised the meaning of this objective from what it saw in pre-training to being a totally aligned agent. I agree it probably helps a lot, but you still face the same-old inner-alignment issues of correctly generalising something that was seen in (pre-)training to deployment, when deployment looks different to training. This is somewhat of a generalisation of points 2-4 above.
Whilst this helps a lot with outer-alignment, I don’t think it solves that completely either. Yes models are able to recognise and probably correctly annotate a lot of data for pre-training, but are we really confident this is going to effectively capture human values, or some coherent-extrapolated-volition thereof? Even with all annotations correct, this objective seems like “output nice, safe-looking text that a human might output” and not “genuinely optimise for human values”. Value alignment aside, it’s probable this method will greatly help with intent alignment, and whilst intent alignment is probably a good target for current frontier AI, it that comes with many of its own problems, primarily misuse, where another large chunk of my P(doom) comes from.
I also want to generalise points 3, 4, 6, and what Steven Byrnes is claiming into: Training a model to act like an aligned human-level intelligence is not the same as training a model to act like an aligned super-intelligence, and whatever we do to raise capabilities here may also alter or break alignment, and so cannot be relied upon.
TL;DR I think safety pre-training is probably a huge boost to alignment, but our work is far from done and there are still lots of issues / uncertainties.