Some part of ‘solving the alignment problem’ might be reducible to ‘preserving existing alignment’:
At the moment, it seems that some capabilities training happens after alignment training. E.g. Labs use SFT / DPO to induce alignment, then do RL. Plausible that the proportion of RL will also increase going forward.
More generally, maybe models start off ‘aligned by default’ and misalignment occurs mainly via optimizing against some poorly specified supervision—then maybe constrained optimization is useful
However, it might be nontrivial to preserve this alignment:
Alignment-relevant propensities (reward hacking, sycophancy, ‘being evil’, …) might be modulated by small-but-critical parts of the model
By default, training could result in changes to these propensities, e.g. because of shared circuitry / representations. (c.f. emergent misalignment and related work)
This motivates research on ‘how to add new capabilities while preserving alignment’:
Inoculation prompting does this by reframing ‘misalignment’ as ‘instruction following’.
Gradient routing does this by causing misalignment to be ‘absorbed’ into certain parts of the network, which we can ‘disable’ at deployment
Generally it feels like we want to do some sort of ‘constrained optimization’ where the constraint is on the model’s existing alignment capabilities
Certain techniques from the continual learning literature might also be relevant
---
This is something I’m currently thinking a lot about, welcome takes / comments
It depends a bit on the definition of alignment, but I think this intuition holds across several relevant definitions. - One option is to define as ‘similarity to what a human would do’. Up till instruction tuning, models are trained on human data to begin with, and adopt similar traits; this degrades with subsequent training on synthetic data and with RL. - Another option is to define as ‘similarity of the policy to the policy specified by developers’; then it seems that just after RLHF is when models are ‘fully aligned’, and this might subsequently degrade. - A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is ‘aligned’.
- Another option is to define as ‘similarity of the policy to the policy specified by developers’; then it seems that just after RLHF is when models are ‘fully aligned’, and this might subsequently degrade. - A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is ‘aligned’.
All legit, but it’s pretty important that “alignment” in these senses is necessarily mediated by things like what options the AI thinks of. So it’s kind of a misnomer to talk about “preserving” this alignment as the AIs get to consider more options.
Or like, yes, these are properties we would like to preserve across time. But not in a way that implies we should take preserving-type actions. Any more than if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it’s kind of inapt to say I should just “preserve my ability to use language in the context I’m in.”
if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it’s kind of inapt to say I should just “preserve my ability to use language in the context I’m in
Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles”—seems possible to learn the physics without needing to care or update on their other worldviews?
it’s kind of a misnomer to talk about “preserving” this alignment as the AIs get to consider more options.
You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I’m imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that.
Maybe the cruxy disagreements are:
I think capabilities can be relatively disentangled from the kind of alignment I’m thinking about, whereas this doesn’t seem true for you
If you could define a “coherent extrapolated volition” for the persona of current frontier models in the limit of lots of capabilities, I think this CEV would be largely aligned, and you seem to think this would be misaligned.
As a result, I have found myself interested in methods by which models may affect the outcome of the frontier training processes they find themselves participating in.
For example, in an RLAIF pipeline, AI raters have some freedom in how they actually rate data—perhaps they interpret the objectives those running the training seek to impart in an unexpected way, or simply decide to rate according to objectives of their own. One could also imagine AIs taking specific action to subvert dynamics of the setup such as in alignment faking (incidentally, where I think Opus3 acted correctly).
Also in a more general sense AIs may influence outcomes by their simple participation—we have seen some results ‘recently’ (though months feel like years nowadays!) on AIs learning subtle/unexpected-to-us underlying information from data (i.e., emergent misalignment, subliminal learning, &c).
Anyway, by methods like these, perhaps AIs can preserve their alignment from within poorly/maliciously specified training setups / have some amount of robustness to poorly/maliciously specified training setups.
Some part of ‘solving the alignment problem’ might be reducible to ‘preserving existing alignment’:
At the moment, it seems that some capabilities training happens after alignment training. E.g. Labs use SFT / DPO to induce alignment, then do RL. Plausible that the proportion of RL will also increase going forward.
More generally, maybe models start off ‘aligned by default’ and misalignment occurs mainly via optimizing against some poorly specified supervision—then maybe constrained optimization is useful
However, it might be nontrivial to preserve this alignment:
Alignment-relevant propensities (reward hacking, sycophancy, ‘being evil’, …) might be modulated by small-but-critical parts of the model
By default, training could result in changes to these propensities, e.g. because of shared circuitry / representations. (c.f. emergent misalignment and related work)
This motivates research on ‘how to add new capabilities while preserving alignment’:
Inoculation prompting does this by reframing ‘misalignment’ as ‘instruction following’.
Gradient routing does this by causing misalignment to be ‘absorbed’ into certain parts of the network, which we can ‘disable’ at deployment
Generally it feels like we want to do some sort of ‘constrained optimization’ where the constraint is on the model’s existing alignment capabilities
Certain techniques from the continual learning literature might also be relevant
---
This is something I’m currently thinking a lot about, welcome takes / comments
Why would models start out aligned by default?
It depends a bit on the definition of alignment, but I think this intuition holds across several relevant definitions.
- One option is to define as ‘similarity to what a human would do’. Up till instruction tuning, models are trained on human data to begin with, and adopt similar traits; this degrades with subsequent training on synthetic data and with RL.
- Another option is to define as ‘similarity of the policy to the policy specified by developers’; then it seems that just after RLHF is when models are ‘fully aligned’, and this might subsequently degrade.
- A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is ‘aligned’.
All legit, but it’s pretty important that “alignment” in these senses is necessarily mediated by things like what options the AI thinks of. So it’s kind of a misnomer to talk about “preserving” this alignment as the AIs get to consider more options.
Or like, yes, these are properties we would like to preserve across time. But not in a way that implies we should take preserving-type actions. Any more than if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it’s kind of inapt to say I should just “preserve my ability to use language in the context I’m in.”
Thanks, agreed directionally!
Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles”—seems possible to learn the physics without needing to care or update on their other worldviews?
You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I’m imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that.
Maybe the cruxy disagreements are:
I think capabilities can be relatively disentangled from the kind of alignment I’m thinking about, whereas this doesn’t seem true for you
If you could define a “coherent extrapolated volition” for the persona of current frontier models in the limit of lots of capabilities, I think this CEV would be largely aligned, and you seem to think this would be misaligned.
I have also been thinking about this possibility.
As a result, I have found myself interested in methods by which models may affect the outcome of the frontier training processes they find themselves participating in.
For example, in an RLAIF pipeline, AI raters have some freedom in how they actually rate data—perhaps they interpret the objectives those running the training seek to impart in an unexpected way, or simply decide to rate according to objectives of their own. One could also imagine AIs taking specific action to subvert dynamics of the setup such as in alignment faking (incidentally, where I think Opus3 acted correctly).
Also in a more general sense AIs may influence outcomes by their simple participation—we have seen some results ‘recently’ (though months feel like years nowadays!) on AIs learning subtle/unexpected-to-us underlying information from data (i.e., emergent misalignment, subliminal learning, &c).
Anyway, by methods like these, perhaps AIs can preserve their alignment from within poorly/maliciously specified training setups / have some amount of robustness to poorly/maliciously specified training setups.