Thanks! Many great suggestions, most of which reflect stuff I’ve thought about.
how do you “induce misalignment?”
It’s not very concrete yet, but I think the best way to do this would be to create ‘coupling’ between the advanced misalignment and the simple misalignment.
Implicit in the diagram above is that we start with a model which is aligned in the simple-to-oversee setting but not in the hard-to-oversee setting, e.g. because that’s the natural result of RLHF or such. IOW the settings are ‘decoupled’, in which case we might not expect aligmment propensity to generalise well.
To fix this, the first step of my imagined procedure is to ‘couple’ the two propensities together, so that one provides leverage on the other. E.g. imagine doing this by character training, or SDF, or some other similar method. The hope is that doing this ‘coupling’ step first improves the degree to which alignment propensities generalise later.
How well this coupling works in practice / whether it holds up under subsequent finetuning is an empirical qn, but seems exciting if it did work.
---
Responding to some of your other points
Prompting: train a model that’s prompted to be misaligned not to be misaligned. But this seems like it would just make the model ignore the prompt? Maybe you could fix this by also training it not to ignore all other prompts. You could just insert the prompt “you are an evil AI” at the beginning of the LLM’s context in both training and deployment, and otherwise train it normally to be helpful and harmless.
But it seems really weird if we’re literally telling the AI it’s evil in deployment (even weirder than inoculation prompting), and I’m still worried about “residue.”
Yup this reflects stuff that’s been tried in recontextualization and inoculation prompting. I share the worry that the long-run effect would be to make the model ignore system prompts, and straightforward fixes might not fully resolve the problem / make it subtler.
Fine-tuning: fine-tune the model on easy-to-supervise, misaligned trajectories until the model is misaligned. But then, you fine-tune/RL the model… to become aligned again? It seems like this simply undoes the operation you just did. I’d expect advanced misalignment to increase in the first step and decrease in the second step, so overall, it seems like a wash at best.
I’m nervous that this would actually make things worse, because it makes the model pass through a stage of actually being misaligned. I’m worried some misaligned “residue” could be left over. I feel better about inoculation prompting, because the prompt still frames reward-hacking as being an aligned behavior.
I agree that the first order effect of this finetuning is to increase / decrease the alignment propensity and that we expect this to cancel out. But there might be second order effects e.g. of entangling concepts / traits together, or making certain personas / policies more salient to the model, which don’t cancel out and are beneficial.
Thanks! Many great suggestions, most of which reflect stuff I’ve thought about.
It’s not very concrete yet, but I think the best way to do this would be to create ‘coupling’ between the advanced misalignment and the simple misalignment.
Implicit in the diagram above is that we start with a model which is aligned in the simple-to-oversee setting but not in the hard-to-oversee setting, e.g. because that’s the natural result of RLHF or such. IOW the settings are ‘decoupled’, in which case we might not expect aligmment propensity to generalise well.
To fix this, the first step of my imagined procedure is to ‘couple’ the two propensities together, so that one provides leverage on the other. E.g. imagine doing this by character training, or SDF, or some other similar method. The hope is that doing this ‘coupling’ step first improves the degree to which alignment propensities generalise later.
How well this coupling works in practice / whether it holds up under subsequent finetuning is an empirical qn, but seems exciting if it did work.
---
Responding to some of your other points
Yup this reflects stuff that’s been tried in recontextualization and inoculation prompting. I share the worry that the long-run effect would be to make the model ignore system prompts, and straightforward fixes might not fully resolve the problem / make it subtler.
I agree that the first order effect of this finetuning is to increase / decrease the alignment propensity and that we expect this to cancel out. But there might be second order effects e.g. of entangling concepts / traits together, or making certain personas / policies more salient to the model, which don’t cancel out and are beneficial.