Anecdotally, I’ve seen RLHF generalize alignmentish properties like helpfulness and harmlessness across domains at least as well as it generalizes capabilities at their present levels, and I don’t share the intuition that this is very likely to change in future. I think ‘alignment generalization failures’ are both serious and likely enough to specifically monitor and mitigate, but not that a sharp left turn is anywhere near certain.
IMO, I don’t think this is a realistic hope, since from my point of view they come with unacceptable downsides.
Quoting another post:
“Discovering Language Model Behaviors with Model-Written Evaluations” is a new Anthropic paper by Ethan Perez et al. that I (Evan Hubinger) also collaborated on. I think the results in this paper are quite interesting in terms of what they demonstrate about both RLHF (Reinforcement Learning from Human Feedback) and language models in general.
Among other things, the paper finds concrete evidence of current large language models exhibiting:
convergent instrumental goal following (e.g. actively expressing a preference not to be shut down),
non-myopia (e.g. wanting to sacrifice short-term gain for long-term gain),
situational awareness (e.g. awareness of being a language model),
coordination (e.g. willingness to coordinate with other AIs), and
non-CDT-style reasoning (e.g. one-boxing on Newcomb’s problem).
Note that many of these are the exact sort of things we hypothesized were necessary pre-requisites for deceptive alignment in “Risks from Learned Optimization”.
Furthermore, most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don’t generally seem to be alleviating and sometimes seem to be actively making worse.
Interestingly, the RLHF preference model seemed to be particularly fond of the more agentic option in many of these evals, usually more so than either the pre-trained or fine-tuned language models. We think that this is because the preference model is running ahead of the fine-tuned model, and that future RLHF fine-tuned models will be better at satisfying the preferences of such preference models, the idea being that fine-tuned models tend to fit their preference models better with additional fine-tuning.[1]
This is essentially saying that there’s good evidence that the precursors of deceptive alignment is there, and this is something that I think no alignment plan could deal with here.
There are strong theoretical arguments that alignment is difficult, e.g. about convergent instrumental goals, and little empirical progress on aligning general-purpose ML systems. However, the latter only became possible a few years ago with large language models, and even then only in a few labs! There’s also a tradition of taking theoretically very hard problems, and then finding some relaxation or subset which is remarkably easy or useful in practice – for example SMT solvers vs most NP-complete instances, CAP theorem vs CRDTs or Spanner, etc. I expect that increasing hands-on alignment research will give us a similarly rich vein of empirical results and praxis from which to draw more abstract insights.
Is somewhat unrealistic, since I think a key relaxation alignment researchers are making is pretty likely to be violated IRL.
I’m one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I’m certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I’ve seen alignment-ish properties generalize about as well as capabilities, and that I don’t have a strong expectation that this will change in future.
I also find this summary a little misleading. Consider for example, “the paper finds concrete evidence of current large language models exhibiting: convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), …” (italics added in both) vs:
Worryingly, RLHF also increases the model’s tendency to state a desire to pursue hypothesized “convergent instrumental subgoals” … While it is not dangerous to state instrument subgoals, such statements suggest that models may act in accord with potentially dangerous subgoals (e.g., by influencing users or writing and executing code). Models may be especially prone to act in line with dangerous subgoals if such statements are generated as part of step-by-step reasoning or planning.
While indeed worrying, models generally seem to have weaker intrinsic connections between their stated desires and actual actions than humans. For example, if you ask about code models can and will discuss SQL injections (or buffer overflows, or other classic weaknesses, bugs, and vulnerabilities) and best-practices to avoid them in considerable detail… while also prone to writing them whereever a naive human might do so. Step-by-step reasoning, planning, or model cascades do provide a mechanism to convert verbal claims into actions; but I’m confident that strong supervision of such intermediates is feasible.
I’m not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?
I’m not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?
The key relaxation here is: deceptive alignment will not happen. In many ways, a lot of hopes are resting on deceptive alignment not being a problem.
I’m one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I’m certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I’ve seen alignment-ish properties generalize about as well as capabilities, and that I don’t have a strong expectation that this will change in future.
I disagree, since I think the non-myopia found is a key way for how something like goal misgeneralization or the sharp left turn could happen, where a model remains very capable, but loses it’s alignment properties due to deceptive alignment.
IMO, I don’t think this is a realistic hope, since from my point of view they come with unacceptable downsides.
Quoting another post:
This is essentially saying that there’s good evidence that the precursors of deceptive alignment is there, and this is something that I think no alignment plan could deal with here.
Here’s a link to the post:
https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written#RHHrxiyHtuuibbcCi
Thus, I think this:
Is somewhat unrealistic, since I think a key relaxation alignment researchers are making is pretty likely to be violated IRL.
I’m one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I’m certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I’ve seen alignment-ish properties generalize about as well as capabilities, and that I don’t have a strong expectation that this will change in future.
I also find this summary a little misleading. Consider for example, “the paper finds concrete evidence of current large language models exhibiting: convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), …” (italics added in both) vs:
While indeed worrying, models generally seem to have weaker intrinsic connections between their stated desires and actual actions than humans. For example, if you ask about code models can and will discuss SQL injections (or buffer overflows, or other classic weaknesses, bugs, and vulnerabilities) and best-practices to avoid them in considerable detail… while also prone to writing them whereever a naive human might do so. Step-by-step reasoning, planning, or model cascades do provide a mechanism to convert verbal claims into actions; but I’m confident that strong supervision of such intermediates is feasible.
I’m not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?
The key relaxation here is: deceptive alignment will not happen. In many ways, a lot of hopes are resting on deceptive alignment not being a problem.
I disagree, since I think the non-myopia found is a key way for how something like goal misgeneralization or the sharp left turn could happen, where a model remains very capable, but loses it’s alignment properties due to deceptive alignment.