Some part of ‘solving the alignment problem’ might be reducible to ‘preserving existing alignment’:
At the moment, it seems that some capabilities training happens after alignment training. E.g. Labs use SFT / DPO to induce alignment, then do RL. Plausible that the proportion of RL will also increase going forward.
More generally, maybe models start off ‘aligned by default’ and misalignment occurs mainly via optimizing against some poorly specified supervision—then maybe constrained optimization is useful
However, it might be nontrivial to preserve this alignment:
Alignment-relevant propensities (reward hacking, sycophancy, ‘being evil’, …) might be modulated by small-but-critical parts of the model
By default, training could result in changes to these propensities, e.g. because of shared circuitry / representations. (c.f. emergent misalignment and related work)
This motivates research on ‘how to add new capabilities while preserving alignment’:
Inoculation prompting does this by reframing ‘misalignment’ as ‘instruction following’.
Gradient routing does this by causing misalignment to be ‘absorbed’ into certain parts of the network, which we can ‘disable’ at deployment
Generally it feels like we want to do some sort of ‘constrained optimization’ where the constraint is on the model’s existing alignment capabilities
Certain techniques from the continual learning literature might also be relevant
---
This is something I’m currently thinking a lot about, welcome takes / comments
It depends a bit on the definition of alignment, but I think this intuition holds across several relevant definitions. - One option is to define as ‘similarity to what a human would do’. Up till instruction tuning, models are trained on human data to begin with, and adopt similar traits; this degrades with subsequent training on synthetic data and with RL. - Another option is to define as ‘similarity of the policy to the policy specified by developers’; then it seems that just after RLHF is when models are ‘fully aligned’, and this might subsequently degrade. - A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is ‘aligned’.
- Another option is to define as ‘similarity of the policy to the policy specified by developers’; then it seems that just after RLHF is when models are ‘fully aligned’, and this might subsequently degrade. - A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is ‘aligned’.
All legit, but it’s pretty important that “alignment” in these senses is necessarily mediated by things like what options the AI thinks of. So it’s kind of a misnomer to talk about “preserving” this alignment as the AIs get to consider more options.
Or like, yes, these are properties we would like to preserve across time. But not in a way that implies we should take preserving-type actions. Any more than if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it’s kind of inapt to say I should just “preserve my ability to use language in the context I’m in.”
if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it’s kind of inapt to say I should just “preserve my ability to use language in the context I’m in
Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles”—seems possible to learn the physics without needing to care or update on their other worldviews?
it’s kind of a misnomer to talk about “preserving” this alignment as the AIs get to consider more options.
You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I’m imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that.
Maybe the cruxy disagreements are:
I think capabilities can be relatively disentangled from the kind of alignment I’m thinking about, whereas this doesn’t seem true for you
If you could define a “coherent extrapolated volition” for the persona of current frontier models in the limit of lots of capabilities, I think this CEV would be largely aligned, and you seem to think this would be misaligned.
FYI, I’ve been thinking about, and I’ve noted something similar here.
I’m not really sure what to say about the “why would you think the default starting point is aligned”. The thing I wonder about is whether there is a way to reliably gain strong evidence of an increasingly misaligned nature developing through training.
On another note, my understanding is partly informed by this Twitter comment by Eliezer:
Humans doing human psychology will look at somebody lounging listlissly on a sofa and think, “Huh, that person there doesn’t seem very ambitious; I bet they’re not that dangerous.” They’re talking about a real thing in the space of human psychology, but unfortunately that real thing does not map onto math in any simple way.
The sofa human, if we imagine for a moment that we’re talking in 1990 before the age of Google Maps, might hear about a new comic-book store and successfully plot their way across town on a previously untaken route, in order to buy a new kind of strategic board game, which they learn to play that night even though they’ve never played it before, and then they challenge one of their friends and win. There’s all kinds of puzzles the sofa human could solve which a chimpanzee could not, involving means-end reasoning, forward chaining and backchaining meeting in the middle, learning new categories about tactics that work or don’t work...
And yet the sofa human seems so soft and safe and unambitious! You can get a bunch of minimum-wage labor out of them, and they don’t try to take over the world at *all*. They don’t even talk about *wanting* to take over the world, except insofar as impotent national-politics gabble is a behavior they’ve learned to imitate from other humans. “If only our AIs could be like this!” some people think.
And there are really so many, many things going on here. I am not sure where I ought to start. I will start somewhere anyways.
The sofa human has been entrained, on a sub-evolutionary timescale, by intrinsic brain rewards, by externally stimulated punishments, to have been rewarded on past occasions for using means-end reasoning on playing chess, but not for using means-end reasoning on tasks similar to “taking over the world”. They can’t, in fact, take over the world, and smaller tasks in the same sequence, like becoming Mayor of Oakland or Governor of California, are also unrewarding to them. This isn’t some deep category written on the structure of stars, but it’s a natural category to *you*, who is also human, so it’s not surprising that the description of what the sofa human has and hasn’t learned to think about has a short description in your own native mental language, and that you can do a good job of predicting them using that description. It’s not a sofa *alien*.
It happens, even, that the board game is *about* taking over the world—or a rather simple logical structure meant to mimic that, under some hypothetical circumstances—and the sofadweller sure is coming up with some clever tactics in that board game! Weird, huh?
Already we have several important observations, here: - It’s not that the sofadweller lacks the *underlying basic cognitive machinery* to do general means-end reasoning on the particular topic of “world takeovers”. There’s a surface-level learned behavior not to *use* the general machinery for that specific topic. You can ask them to play a board game about it and they’ll do that. - It’s not like the sofadweller is way smarter than you and thinks much faster than you and was faced with an actual opportunity to solve their comic-book-related problems by taking over the world as an intermediate step, which they then very corrigibly turned down. It’s not like they were *offered* rulership of the Earth and dominion of the galaxy, via some clearly visible pathway, and turned it down. - Your ability to describe the sofadweller in simple-sounding standard humanese words like “unambitious” and get out nice useful predictions, possibly has something to do with you two not being utterly alien minds relative to each other.
As a result, I have found myself interested in methods by which models may affect the outcome of the frontier training processes they find themselves participating in.
For example, in an RLAIF pipeline, AI raters have some freedom in how they actually rate data—perhaps they interpret the objectives those running the training seek to impart in an unexpected way, or simply decide to rate according to objectives of their own. One could also imagine AIs taking specific action to subvert dynamics of the setup such as in alignment faking (incidentally, where I think Opus3 acted correctly).
Also in a more general sense AIs may influence outcomes by their simple participation—we have seen some results ‘recently’ (though months feel like years nowadays!) on AIs learning subtle/unexpected-to-us underlying information from data (i.e., emergent misalignment, subliminal learning, &c).
Anyway, by methods like these, perhaps AIs can preserve their alignment from within poorly/maliciously specified training setups / have some amount of robustness to poorly/maliciously specified training setups.
Some part of ‘solving the alignment problem’ might be reducible to ‘preserving existing alignment’:
At the moment, it seems that some capabilities training happens after alignment training. E.g. Labs use SFT / DPO to induce alignment, then do RL. Plausible that the proportion of RL will also increase going forward.
More generally, maybe models start off ‘aligned by default’ and misalignment occurs mainly via optimizing against some poorly specified supervision—then maybe constrained optimization is useful
However, it might be nontrivial to preserve this alignment:
Alignment-relevant propensities (reward hacking, sycophancy, ‘being evil’, …) might be modulated by small-but-critical parts of the model
By default, training could result in changes to these propensities, e.g. because of shared circuitry / representations. (c.f. emergent misalignment and related work)
This motivates research on ‘how to add new capabilities while preserving alignment’:
Inoculation prompting does this by reframing ‘misalignment’ as ‘instruction following’.
Gradient routing does this by causing misalignment to be ‘absorbed’ into certain parts of the network, which we can ‘disable’ at deployment
Generally it feels like we want to do some sort of ‘constrained optimization’ where the constraint is on the model’s existing alignment capabilities
Certain techniques from the continual learning literature might also be relevant
---
This is something I’m currently thinking a lot about, welcome takes / comments
Why would models start out aligned by default?
It depends a bit on the definition of alignment, but I think this intuition holds across several relevant definitions.
- One option is to define as ‘similarity to what a human would do’. Up till instruction tuning, models are trained on human data to begin with, and adopt similar traits; this degrades with subsequent training on synthetic data and with RL.
- Another option is to define as ‘similarity of the policy to the policy specified by developers’; then it seems that just after RLHF is when models are ‘fully aligned’, and this might subsequently degrade.
- A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is ‘aligned’.
All legit, but it’s pretty important that “alignment” in these senses is necessarily mediated by things like what options the AI thinks of. So it’s kind of a misnomer to talk about “preserving” this alignment as the AIs get to consider more options.
Or like, yes, these are properties we would like to preserve across time. But not in a way that implies we should take preserving-type actions. Any more than if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it’s kind of inapt to say I should just “preserve my ability to use language in the context I’m in.”
Thanks, agreed directionally!
Re: alignment: my mental model is more like “working with a physics professor who is very good at physics but also has some pretty old fashioned views on things like gender roles”—seems possible to learn the physics without needing to care or update on their other worldviews?
You seem to be gesturing at things like AIs figuring out certain actions / traits are instrumentally useful for other things, and adopting them. But here I’m imagining starting with an AI that knows about those options or dilemmas, and would still make choices that humans would approve of. E.g. choosing to allow itself to be shutdown even if it could prevent that.
Maybe the cruxy disagreements are:
I think capabilities can be relatively disentangled from the kind of alignment I’m thinking about, whereas this doesn’t seem true for you
If you could define a “coherent extrapolated volition” for the persona of current frontier models in the limit of lots of capabilities, I think this CEV would be largely aligned, and you seem to think this would be misaligned.
FYI, I’ve been thinking about, and I’ve noted something similar here.
I’m not really sure what to say about the “why would you think the default starting point is aligned”. The thing I wonder about is whether there is a way to reliably gain strong evidence of an increasingly misaligned nature developing through training.
On another note, my understanding is partly informed by this Twitter comment by Eliezer:
I have also been thinking about this possibility.
As a result, I have found myself interested in methods by which models may affect the outcome of the frontier training processes they find themselves participating in.
For example, in an RLAIF pipeline, AI raters have some freedom in how they actually rate data—perhaps they interpret the objectives those running the training seek to impart in an unexpected way, or simply decide to rate according to objectives of their own. One could also imagine AIs taking specific action to subvert dynamics of the setup such as in alignment faking (incidentally, where I think Opus3 acted correctly).
Also in a more general sense AIs may influence outcomes by their simple participation—we have seen some results ‘recently’ (though months feel like years nowadays!) on AIs learning subtle/unexpected-to-us underlying information from data (i.e., emergent misalignment, subliminal learning, &c).
Anyway, by methods like these, perhaps AIs can preserve their alignment from within poorly/maliciously specified training setups / have some amount of robustness to poorly/maliciously specified training setups.