“Indirect alignment”: a speculative idea for aligning models in hard-to-oversee settings
Problem: “Directly” aligning models might be hard sometimes, because it’s hard to provide perfect oversight (e.g. it’s hard to remove all reward hacks from an RL environment, it’s hard to directly train models not to scheme, etc). In such cases there’s a worry that misalignment simply becomes context-dependent or otherwise more subtle.
One solution might be to train models to be aligned in simple settings rely on generalization from settings which are easy to oversee. (bottom arrow). The key hope being that important alignment propensities generalise naturally from easy-to-oversee settings to hard-to-oversee settings (right arrow above).
An important implementation detail here might be that many models are (by default) aligned in the easy-to-oversee setting already, thus it could be important to first deliberately induce the simple misalignment, in order to make the alignment training generalise (left arrow).
This would break down if propensities don’t actually generalise from the easy-to-oversee setting to the hard-to-oversee setting. Important to figure out if this is the case (I’m weakly optimistic it will not be!)
Interesting. One thing that’s tripping me up: how do you “induce misalignment?” Some ideas:
Fine-tuning: fine-tune the model on easy-to-supervise, misaligned trajectories until the model is misaligned. But then, you fine-tune/RL the model… to become aligned again? It seems like this simply undoes the operation you just did. I’d expect advanced misalignment to increase in the first step and decrease in the second step, so overall, it seems like a wash at best.
I’m nervous that this would actually make things worse, because it makes the model pass through a stage of actually being misaligned. I’m worried some misaligned “residue” could be left over. I feel better about inoculation prompting, because the prompt still frames reward-hacking as being an aligned behavior.
Prompting: train a model that’s prompted to be misaligned not to be misaligned. But this seems like it would just make the model ignore the prompt? Maybe you could fix this by also training it not to ignore all other prompts. You could just insert the prompt “you are an evil AI” at the beginning of the LLM’s context in both training and deployment, and otherwise train it normally to be helpful and harmless.
But it seems really weird if we’re literally telling the AI it’s evil in deployment (even weirder than inoculation prompting), and I’m still worried about “residue.”
Steering vectors: you could train a steering vector for “alignment” on easy-to-supervise trajectories, using trajectory pairs from e.g. the model prompted to be misaligned vs. the model prompted to be aligned. You could just add that steering vector in deployment at whatever intensity you want, to make the model even more aligned than it would be by default.
I feel pretty good about the model not becoming misaligned as a result, but maybe there would be annoying side effects.
Maybe you haven’t nailed down the details—even if so, this seems like a good butterfly idea that is worth thinking more about.
Possible simple experiment to see if this works:
Fine-tune an AI to write buggy code in Python. Also fine-tune it on benign chat trajectories and Java coding trajectories, to make sure the model remains aligned in those settings (no emergent misalignment). This will be our “model organism” of misalignment M_O, where we pretend that we have an easy time supervising the model in the chat setting and a very hard time supervising it in the coding setting.
Importantly, since we make sure the model is already aligned in the chat setting, we don’t have obvious “leverage” we can use to align it in the coding setting.
The Java setting represents difficult-to-supervise tasks where the model isn’t already misaligned; we want to make sure it doesn’t become misaligned on those tasks.
Do the “prompting” thing above, telling the model “you are an evil AI.” Check that this prompt causes the model organism to act misaligned in both the chat setting and the Java coding setting (while remaining misaligned in the Python coding setting).
Fine-tune M_O to be aligned in the chat setting, with the misaligned prompt, to get model M_A, which we hope is fully aligned. Also fine-tune a different instance of M_O on the same data, but without the misaligned prompt, as a baseline M_B.
Test the hypothesis: M_A (with the misalignment prompt) is more aligned on Python coding than either M_B or M_O (without the misalignment prompt). It should also be equally aligned on Java coding and chatting (i.e. “residue” from the prompt doesn’t make it misaligned in settings where it was originally aligned).
Thanks! Many great suggestions, most of which reflect stuff I’ve thought about.
how do you “induce misalignment?”
It’s not very concrete yet, but I think the best way to do this would be to create ‘coupling’ between the advanced misalignment and the simple misalignment.
Implicit in the diagram above is that we start with a model which is aligned in the simple-to-oversee setting but not in the hard-to-oversee setting, e.g. because that’s the natural result of RLHF or such. IOW the settings are ‘decoupled’, in which case we might not expect aligmment propensity to generalise well.
To fix this, the first step of my imagined procedure is to ‘couple’ the two propensities together, so that one provides leverage on the other. E.g. imagine doing this by character training, or SDF, or some other similar method. The hope is that doing this ‘coupling’ step first improves the degree to which alignment propensities generalise later.
How well this coupling works in practice / whether it holds up under subsequent finetuning is an empirical qn, but seems exciting if it did work.
---
Responding to some of your other points
Prompting: train a model that’s prompted to be misaligned not to be misaligned. But this seems like it would just make the model ignore the prompt? Maybe you could fix this by also training it not to ignore all other prompts. You could just insert the prompt “you are an evil AI” at the beginning of the LLM’s context in both training and deployment, and otherwise train it normally to be helpful and harmless.
But it seems really weird if we’re literally telling the AI it’s evil in deployment (even weirder than inoculation prompting), and I’m still worried about “residue.”
Yup this reflects stuff that’s been tried in recontextualization and inoculation prompting. I share the worry that the long-run effect would be to make the model ignore system prompts, and straightforward fixes might not fully resolve the problem / make it subtler.
Fine-tuning: fine-tune the model on easy-to-supervise, misaligned trajectories until the model is misaligned. But then, you fine-tune/RL the model… to become aligned again? It seems like this simply undoes the operation you just did. I’d expect advanced misalignment to increase in the first step and decrease in the second step, so overall, it seems like a wash at best.
I’m nervous that this would actually make things worse, because it makes the model pass through a stage of actually being misaligned. I’m worried some misaligned “residue” could be left over. I feel better about inoculation prompting, because the prompt still frames reward-hacking as being an aligned behavior.
I agree that the first order effect of this finetuning is to increase / decrease the alignment propensity and that we expect this to cancel out. But there might be second order effects e.g. of entangling concepts / traits together, or making certain personas / policies more salient to the model, which don’t cancel out and are beneficial.
Something similar I’ve been thinking about is putting models in environments with misalignment “temptations” like an easy reward hack and training them to recognize what this type of payoff pattern looks like (e.g. easy win but sacrifice principle) and NOT take it. Recent work shows some promising efforts getting LLMs to explain their reasoning, introspect, and so forth. I think this could be interesting to do some experiments with and am trying to write up my thoughts on why this might be useful and maybe what those could look like.
Based on vibes, I found it more probable that the function from Hard to oversee → Easy to oversee is not 1-to-1 and thus reversible. It feels more like a projection function, so when you get simple alignment and try to unproject it you still just get a really high dimension space where advanced alignment is a negligible region.
“Indirect alignment”: a speculative idea for aligning models in hard-to-oversee settings
Problem: “Directly” aligning models might be hard sometimes, because it’s hard to provide perfect oversight (e.g. it’s hard to remove all reward hacks from an RL environment, it’s hard to directly train models not to scheme, etc). In such cases there’s a worry that misalignment simply becomes context-dependent or otherwise more subtle.
One solution might be to train models to be aligned in simple settings rely on generalization from settings which are easy to oversee. (bottom arrow). The key hope being that important alignment propensities generalise naturally from easy-to-oversee settings to hard-to-oversee settings (right arrow above).
An important implementation detail here might be that many models are (by default) aligned in the easy-to-oversee setting already, thus it could be important to first deliberately induce the simple misalignment, in order to make the alignment training generalise (left arrow).
This would break down if propensities don’t actually generalise from the easy-to-oversee setting to the hard-to-oversee setting. Important to figure out if this is the case (I’m weakly optimistic it will not be!)
Interesting. One thing that’s tripping me up: how do you “induce misalignment?” Some ideas:
Fine-tuning: fine-tune the model on easy-to-supervise, misaligned trajectories until the model is misaligned. But then, you fine-tune/RL the model… to become aligned again? It seems like this simply undoes the operation you just did. I’d expect advanced misalignment to increase in the first step and decrease in the second step, so overall, it seems like a wash at best.
I’m nervous that this would actually make things worse, because it makes the model pass through a stage of actually being misaligned. I’m worried some misaligned “residue” could be left over. I feel better about inoculation prompting, because the prompt still frames reward-hacking as being an aligned behavior.
Prompting: train a model that’s prompted to be misaligned not to be misaligned. But this seems like it would just make the model ignore the prompt? Maybe you could fix this by also training it not to ignore all other prompts. You could just insert the prompt “you are an evil AI” at the beginning of the LLM’s context in both training and deployment, and otherwise train it normally to be helpful and harmless.
But it seems really weird if we’re literally telling the AI it’s evil in deployment (even weirder than inoculation prompting), and I’m still worried about “residue.”
Steering vectors: you could train a steering vector for “alignment” on easy-to-supervise trajectories, using trajectory pairs from e.g. the model prompted to be misaligned vs. the model prompted to be aligned. You could just add that steering vector in deployment at whatever intensity you want, to make the model even more aligned than it would be by default.
I feel pretty good about the model not becoming misaligned as a result, but maybe there would be annoying side effects.
Maybe you haven’t nailed down the details—even if so, this seems like a good butterfly idea that is worth thinking more about.
Possible simple experiment to see if this works:
Fine-tune an AI to write buggy code in Python. Also fine-tune it on benign chat trajectories and Java coding trajectories, to make sure the model remains aligned in those settings (no emergent misalignment). This will be our “model organism” of misalignment M_O, where we pretend that we have an easy time supervising the model in the chat setting and a very hard time supervising it in the coding setting.
Importantly, since we make sure the model is already aligned in the chat setting, we don’t have obvious “leverage” we can use to align it in the coding setting.
The Java setting represents difficult-to-supervise tasks where the model isn’t already misaligned; we want to make sure it doesn’t become misaligned on those tasks.
Do the “prompting” thing above, telling the model “you are an evil AI.” Check that this prompt causes the model organism to act misaligned in both the chat setting and the Java coding setting (while remaining misaligned in the Python coding setting).
Fine-tune M_O to be aligned in the chat setting, with the misaligned prompt, to get model M_A, which we hope is fully aligned. Also fine-tune a different instance of M_O on the same data, but without the misaligned prompt, as a baseline M_B.
Test the hypothesis: M_A (with the misalignment prompt) is more aligned on Python coding than either M_B or M_O (without the misalignment prompt). It should also be equally aligned on Java coding and chatting (i.e. “residue” from the prompt doesn’t make it misaligned in settings where it was originally aligned).
Thanks! Many great suggestions, most of which reflect stuff I’ve thought about.
It’s not very concrete yet, but I think the best way to do this would be to create ‘coupling’ between the advanced misalignment and the simple misalignment.
Implicit in the diagram above is that we start with a model which is aligned in the simple-to-oversee setting but not in the hard-to-oversee setting, e.g. because that’s the natural result of RLHF or such. IOW the settings are ‘decoupled’, in which case we might not expect aligmment propensity to generalise well.
To fix this, the first step of my imagined procedure is to ‘couple’ the two propensities together, so that one provides leverage on the other. E.g. imagine doing this by character training, or SDF, or some other similar method. The hope is that doing this ‘coupling’ step first improves the degree to which alignment propensities generalise later.
How well this coupling works in practice / whether it holds up under subsequent finetuning is an empirical qn, but seems exciting if it did work.
---
Responding to some of your other points
Yup this reflects stuff that’s been tried in recontextualization and inoculation prompting. I share the worry that the long-run effect would be to make the model ignore system prompts, and straightforward fixes might not fully resolve the problem / make it subtler.
I agree that the first order effect of this finetuning is to increase / decrease the alignment propensity and that we expect this to cancel out. But there might be second order effects e.g. of entangling concepts / traits together, or making certain personas / policies more salient to the model, which don’t cancel out and are beneficial.
Something similar I’ve been thinking about is putting models in environments with misalignment “temptations” like an easy reward hack and training them to recognize what this type of payoff pattern looks like (e.g. easy win but sacrifice principle) and NOT take it. Recent work shows some promising efforts getting LLMs to explain their reasoning, introspect, and so forth. I think this could be interesting to do some experiments with and am trying to write up my thoughts on why this might be useful and maybe what those could look like.
Based on vibes, I found it more probable that the function from Hard to oversee → Easy to oversee is not 1-to-1 and thus reversible. It feels more like a projection function, so when you get simple alignment and try to unproject it you still just get a really high dimension space where advanced alignment is a negligible region.