Hypothesis: alignment-related properties of an ML model will be mostly determined by the part(s) of training that were most responsible for capabilities.
If you take a very smart AI model with arbitrary goals/values and train it to output any particular sequence of tokens using SFT, it’ll almost certainly work. So can we align an arbitrary model by training them to say “I’m a nice chatbot, I wouldn’t cause any existential risk, … ”? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those particular tokens in that particular situation.
On the other hand, if you train an AI model from the ground up with a hypothetical “perfect reward function” that always gives correct ratings to the behaviour of the AI, (and you trained on a distribution of tasks similar to the one you are deploying it on) then I would guess that this AI, at least until around the human range, will behaviorally basically act according to the reward function.
A related intuition pump here for the difference is the effect of training someone to say “I care about X” by punishing them until they say X consistently, vs raising them consistently with a large value set / ideology over time. For example, students are sometimes forced to write “I won’t do X” or “I will do Y” 100 times, and usually this doesn’t work at all. Similarly, randomly taking a single ethics class during high school usually doesn’t cause people to enduringly act according to their stated favorite moral theory. However, raising your child Catholic, taking them to Catholic school, taking them to church, taking them to Sunday school, constantly talking to them about the importance of Catholic morality is in practice fairly likely to make them a pretty robust Catholic.
There are maybe two factors being conflated above: (1) the fraction of training / upraising focused on goal X, and (2) the extent to which goal X was getting the capabilities. The reason why I think (2) is a more important / better explanation than (1) is because probably the heuristics that are actually driving the long horizon goal directed behaviors of the model are going to be whatever parts of the models will arise from the long horizon goal directed capabilities training.
Regardless, there’s some sort of spectrum from deep to shallow alignment training for ML models / humans, ranging across:
idealized RL training with a perfect reward function that’s used to train the model in all circumstances
raising a human to consistently care about some set of values their parents have, constantly bringing it up / rewarding good behaviour according to them
High school ethics class
One-off writing tasks of “I won’t do X”
I think that current alignment techniques seem closest to high school ethics classes in their depth, because the vast majority of training is extremely unrelated to ethics / alignment / morality (like high school),. Training is mostly RLVR on coding/math/etc or pretraining, plus a bit of alignment training on the side. I think I’d feel more robust about it sticking if it was closer to what a parent highly focused on raising an ethical child would do, and would start to feel pretty good about the situation if most of the ways that the AI learned capabilities were downstream of a good feedback signal (though I want to think about this a bit more).
The probem with doing this thoroughly is that you need different LLMs to produce actions (trained with only “aligned” examples, by whatever criteria) and for a world model, trained with a full dataset including all of the unaligned things people do.
What we care about is “how the AI acts when it’s managing a giant bureaucracy of copies of itself, on a giant datacenter, having been given instructions by the humans to make rapid research progress as fast as possible while also keeping things safe, ethical, and legal, and while also providing strategic advice to company leadership...”
Which part of the modern training pipeline is most similar to this situation? That’s the part that will probably influence most how the AI acts in this situation.
Suppose the modern training pipeline has three parts: Pretraining, RLVR on a big bag of challenging tasks, and “alignment training” consisting of a bunch of ‘gotcha’ tasks where you are tempted to do something unethical, illegal, or reward-hacky, and if you do, you get negatively reinforced.
Seems like pretraining is the most dissimilar to the situation we actually care about. What about RLVR and alignment training?
I don’t think it’s obvious. The RLVR is dissimilar in that it’ll mostly be “smaller” situations. the model that first automates AI R&D won’t have been trained on a hundred thousand examples of automating AI R&D, instead it’ll have been trained on smaller-scale tasks (e.g. making research progress on a team of 100 fellow agents over the course of a few days?)
The alignment training is dissimilar in that way, too, probably.
I think this is a very important hypothesis but I disagree with various parts of the analysis.
Probably the heuristics that are actually driving the long horizon goal directed behaviors of the model are going to be whatever parts of the models will arise from the long horizon goal directed capabilities training.
I think this is an important observation, and is the main thing I would have cited for why the hypothesis might be true. But I think it’s plausible that the AI’s capabilities here could be separated from its propensities by instrumentalizing the learned heuristics to aligned motivations. I can imagine that doing inoculation prompting and a bit of careful alignment training at the beginning and end of capabilities training could make it so that all of the learned heuristics are subservient to corrigible motivations—i.e., so that when the heuristics recommend something that would be harmful or lead to human disempowerment, the AI would recognize this and choose otherwise.
On the other hand, if you train an AI model from the ground up with a hypothetical “perfect reward function” that always gives correct ratings to the behaviour of the AI, (and you trained on a distribution of tasks similar to the one you are deploying it on) then I would guess that this AI, at least until around the human range, will behaviorally basically act according to the reward function.
Even if the AI had a perfect behavioral reward function during capabilities-focused training, it wouldn’t provide much pressure towards motivations that don’t take over. During training to be good at e.g. coding problems, even if there’s no reward-hacking going on, the AI might still develop coding related drives that don’t care about humanity’s continued control, since humanity’s continued control is not at stake during that training (this is especially relevant when the AI is saliently aware that it’s in a training environment isolated from the world—i.e. inner misalignment). Then when it’s doing coding work in the world that actually does have consequences for human control, it might not care. (Also note that generalizing “according to the reward function” is importantly underspecified.)
So can we align an arbitrary model by training them to say “I’m a nice chatbot, I wouldn’t cause any existential risk, … ”? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those particular tokens in that particular situation.
This type of training (currently) does actually generalize to other propensities to some extent in some circumstances. See emergent misalignment. I think this is plausibly also a large fraction of how character training works today (see “coupling” here).
Hmm, I think one other potential explanation for the fact that Catholic school works way better than writing “I won’t do X” is the training-deployment distribution shift.
In your analogy, the training is happening on the writing-on-chalkboard distribution, but what we care about is the doing-actions-IRL distribution. Whereas for LLMs, it’s much easier for us to train (or at least validate) in deployment-like environments, which feels a lot more likely to transfer.
So maybe the important part of training (for alignment purposes) isn’t so much “the part that the capabilities come from” so much as it is “the part that looks pretty similar to deployment”, which might be pretty different!
It seems to me that the part of training most responsible for capabilities would be pre-training rather than RL (something like GRPO requires the base model to get at least one rollout correct). But also, it feels like most RL training has to be objective agnostic; a coding task wouldn’t clearly have a clear connection to alignment. If our goal is to train an aligned AI where capabilities and alignment goes hand in hand, it seems like we should somehow bake alignment training into pre-training rather than rely on post-training techniques. Unless, its primarily RL that induces long horizon goal directed capability (I suspect it’s some of both).
Hypothesis: alignment-related properties of an ML model will be mostly determined by the part(s) of training that were most responsible for capabilities.
If you take a very smart AI model with arbitrary goals/values and train it to output any particular sequence of tokens using SFT, it’ll almost certainly work. So can we align an arbitrary model by training them to say “I’m a nice chatbot, I wouldn’t cause any existential risk, … ”? Seems like obviously not, because the model will just learn the domain specific / shallow property of outputting those particular tokens in that particular situation.
On the other hand, if you train an AI model from the ground up with a hypothetical “perfect reward function” that always gives correct ratings to the behaviour of the AI, (and you trained on a distribution of tasks similar to the one you are deploying it on) then I would guess that this AI, at least until around the human range, will behaviorally basically act according to the reward function.
A related intuition pump here for the difference is the effect of training someone to say “I care about X” by punishing them until they say X consistently, vs raising them consistently with a large value set / ideology over time. For example, students are sometimes forced to write “I won’t do X” or “I will do Y” 100 times, and usually this doesn’t work at all. Similarly, randomly taking a single ethics class during high school usually doesn’t cause people to enduringly act according to their stated favorite moral theory. However, raising your child Catholic, taking them to Catholic school, taking them to church, taking them to Sunday school, constantly talking to them about the importance of Catholic morality is in practice fairly likely to make them a pretty robust Catholic.
There are maybe two factors being conflated above: (1) the fraction of training / upraising focused on goal X, and (2) the extent to which goal X was getting the capabilities. The reason why I think (2) is a more important / better explanation than (1) is because probably the heuristics that are actually driving the long horizon goal directed behaviors of the model are going to be whatever parts of the models will arise from the long horizon goal directed capabilities training.
Regardless, there’s some sort of spectrum from deep to shallow alignment training for ML models / humans, ranging across:
idealized RL training with a perfect reward function that’s used to train the model in all circumstances
raising a human to consistently care about some set of values their parents have, constantly bringing it up / rewarding good behaviour according to them
High school ethics class
One-off writing tasks of “I won’t do X”
I think that current alignment techniques seem closest to high school ethics classes in their depth, because the vast majority of training is extremely unrelated to ethics / alignment / morality (like high school),. Training is mostly RLVR on coding/math/etc or pretraining, plus a bit of alignment training on the side. I think I’d feel more robust about it sticking if it was closer to what a parent highly focused on raising an ethical child would do, and would start to feel pretty good about the situation if most of the ways that the AI learned capabilities were downstream of a good feedback signal (though I want to think about this a bit more).
Cf. https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC?commentId=koeti9ygXB9wPLnnF
as well as
https://www.lesswrong.com/posts/ASZco85chGouu2LKk/the-fraught-voyage-of-aligned-novelty#Alienness
Relevant OpenAI blog post just today: https://alignment.openai.com/how-far-does-alignment-midtraining-generalize/
Relevant figure:
This basic claim has been discussed as A “Bitter Lesson” Approach to Aligning AGI and ASI. Beren Millidge wrote about this starting at about the same time, but I don’t remember the title.
The probem with doing this thoroughly is that you need different LLMs to produce actions (trained with only “aligned” examples, by whatever criteria) and for a world model, trained with a full dataset including all of the unaligned things people do.
The most complete and recent discussion might be Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training.
Unfinished musing, trying out another framing:
What we care about is “how the AI acts when it’s managing a giant bureaucracy of copies of itself, on a giant datacenter, having been given instructions by the humans to make rapid research progress as fast as possible while also keeping things safe, ethical, and legal, and while also providing strategic advice to company leadership...”
Which part of the modern training pipeline is most similar to this situation? That’s the part that will probably influence most how the AI acts in this situation.
Suppose the modern training pipeline has three parts: Pretraining, RLVR on a big bag of challenging tasks, and “alignment training” consisting of a bunch of ‘gotcha’ tasks where you are tempted to do something unethical, illegal, or reward-hacky, and if you do, you get negatively reinforced.
Seems like pretraining is the most dissimilar to the situation we actually care about. What about RLVR and alignment training?
I don’t think it’s obvious. The RLVR is dissimilar in that it’ll mostly be “smaller” situations. the model that first automates AI R&D won’t have been trained on a hundred thousand examples of automating AI R&D, instead it’ll have been trained on smaller-scale tasks (e.g. making research progress on a team of 100 fellow agents over the course of a few days?)
The alignment training is dissimilar in that way, too, probably.
I think this is a very important hypothesis but I disagree with various parts of the analysis.
I think this is an important observation, and is the main thing I would have cited for why the hypothesis might be true. But I think it’s plausible that the AI’s capabilities here could be separated from its propensities by instrumentalizing the learned heuristics to aligned motivations. I can imagine that doing inoculation prompting and a bit of careful alignment training at the beginning and end of capabilities training could make it so that all of the learned heuristics are subservient to corrigible motivations—i.e., so that when the heuristics recommend something that would be harmful or lead to human disempowerment, the AI would recognize this and choose otherwise.
Even if the AI had a perfect behavioral reward function during capabilities-focused training, it wouldn’t provide much pressure towards motivations that don’t take over. During training to be good at e.g. coding problems, even if there’s no reward-hacking going on, the AI might still develop coding related drives that don’t care about humanity’s continued control, since humanity’s continued control is not at stake during that training (this is especially relevant when the AI is saliently aware that it’s in a training environment isolated from the world—i.e. inner misalignment). Then when it’s doing coding work in the world that actually does have consequences for human control, it might not care. (Also note that generalizing “according to the reward function” is importantly underspecified.)
This type of training (currently) does actually generalize to other propensities to some extent in some circumstances. See emergent misalignment. I think this is plausibly also a large fraction of how character training works today (see “coupling” here).
Hmm, I think one other potential explanation for the fact that Catholic school works way better than writing “I won’t do X” is the training-deployment distribution shift.
In your analogy, the training is happening on the writing-on-chalkboard distribution, but what we care about is the doing-actions-IRL distribution. Whereas for LLMs, it’s much easier for us to train (or at least validate) in deployment-like environments, which feels a lot more likely to transfer.
So maybe the important part of training (for alignment purposes) isn’t so much “the part that the capabilities come from” so much as it is “the part that looks pretty similar to deployment”, which might be pretty different!
It seems to me that the part of training most responsible for capabilities would be pre-training rather than RL (something like GRPO requires the base model to get at least one rollout correct). But also, it feels like most RL training has to be objective agnostic; a coding task wouldn’t clearly have a clear connection to alignment. If our goal is to train an aligned AI where capabilities and alignment goes hand in hand, it seems like we should somehow bake alignment training into pre-training rather than rely on post-training techniques. Unless, its primarily RL that induces long horizon goal directed capability (I suspect it’s some of both).