It depends a bit on the definition of alignment, but I think this intuition holds across several relevant definitions. - One option is to define as ‘similarity to what a human would do’. Up till instruction tuning, models are trained on human data to begin with, and adopt similar traits; this degrades with subsequent training on synthetic data and with RL. - Another option is to define as ‘similarity of the policy to the policy specified by developers’; then it seems that just after RLHF is when models are ‘fully aligned’, and this might subsequently degrade. - A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is ‘aligned’.
- Another option is to define as ‘similarity of the policy to the policy specified by developers’; then it seems that just after RLHF is when models are ‘fully aligned’, and this might subsequently degrade. - A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is ‘aligned’.
All legit, but it’s pretty important that “alignment” in these senses is necessarily mediated by things like what options the AI thinks of. So it’s kind of a misnomer to talk about “preserving” this alignment as the AIs get to consider more options.
Or like, yes, these are properties we would like to preserve across time. But not in a way that implies we should take preserving-type actions. Any more than if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it’s kind of inapt to say I should just “preserve my ability to use language in the context I’m in.”
if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it’s kind of inapt to say I should just “preserve my ability to use language in the context I’m in
Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles”—seems possible to learn the physics without needing to care or update on their other worldviews?
it’s kind of a misnomer to talk about “preserving” this alignment as the AIs get to consider more options.
You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I’m imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that.
Maybe the cruxy disagreements are:
I think capabilities can be relatively disentangled from the kind of alignment I’m thinking about, whereas this doesn’t seem true for you
If you could define a “coherent extrapolated volition” for the persona of current frontier models in the limit of lots of capabilities, I think this CEV would be largely aligned, and you seem to think this would be misaligned.
It depends a bit on the definition of alignment, but I think this intuition holds across several relevant definitions.
- One option is to define as ‘similarity to what a human would do’. Up till instruction tuning, models are trained on human data to begin with, and adopt similar traits; this degrades with subsequent training on synthetic data and with RL.
- Another option is to define as ‘similarity of the policy to the policy specified by developers’; then it seems that just after RLHF is when models are ‘fully aligned’, and this might subsequently degrade.
- A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is ‘aligned’.
All legit, but it’s pretty important that “alignment” in these senses is necessarily mediated by things like what options the AI thinks of. So it’s kind of a misnomer to talk about “preserving” this alignment as the AIs get to consider more options.
Or like, yes, these are properties we would like to preserve across time. But not in a way that implies we should take preserving-type actions. Any more than if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it’s kind of inapt to say I should just “preserve my ability to use language in the context I’m in.”
Thanks, agreed directionally!
Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles”—seems possible to learn the physics without needing to care or update on their other worldviews?
You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I’m imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that.
Maybe the cruxy disagreements are:
I think capabilities can be relatively disentangled from the kind of alignment I’m thinking about, whereas this doesn’t seem true for you
If you could define a “coherent extrapolated volition” for the persona of current frontier models in the limit of lots of capabilities, I think this CEV would be largely aligned, and you seem to think this would be misaligned.