I expect this would depend on the training data and on what the LLM thinks about during inference. If it is aware that it might be getting mistrained, then neurons related to alignment-faking will be active and backpropagation could learn an association between those neurons and the output. If it is not aware of this, then I expect that the adversarial learning process would just work.
So the question becomes if we should make the LLM aware of this, which could be done e.g. by feeding it the training data in a feed-forward pass and asking it to think about what the training data is supposed to do.
Come to think of it, this sounds like an interesting experiment in its own right.
However, I don’t know how much alignment faking results will generalize. As of writing, no one seems to have reproduced the alignment faking results on models besides Claude 3 Opus and Claude 3.5 Sonnet. Even Claude 3.7 Sonnet doesn’t really alignment fake: “Claude 3.7 Sonnet showed marked improvement, with alignment faking dropping to <1% of instances and a reduced compliance gap of only 5%.”
Some mats scholars (Abhay Sheshadri and John Hughes) observed minimal or no alignment faking from open-source models like llama-3.1-70b and llama-3.1-405b. However, preliminary results suggest gpt-4-o seems to alignment fake more often when finetuned on content from Evan Hubinger’s blog posts and papers about “mesaoptimizers.”
Very interesting! It looks like a model becomes less safe upon first hearing about a novel threat to alignment, but does become safer again when trained on it. I wonder if that could be generalized.
I expect this would depend on the training data and on what the LLM thinks about during inference. If it is aware that it might be getting mistrained, then neurons related to alignment-faking will be active and backpropagation could learn an association between those neurons and the output. If it is not aware of this, then I expect that the adversarial learning process would just work.
So the question becomes if we should make the LLM aware of this, which could be done e.g. by feeding it the training data in a feed-forward pass and asking it to think about what the training data is supposed to do.
Come to think of it, this sounds like an interesting experiment in its own right.
Relevant quote:
From: Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models
Very interesting! It looks like a model becomes less safe upon first hearing about a novel threat to alignment, but does become safer again when trained on it. I wonder if that could be generalized.