The Emergent Misalignment paper (https://arxiv.org/abs/2502.17424) suggests that LLMs will learn the easiest way to reach a finetuning objective, not necessarily the expected way. “Be evil” is easier to learn than “write bad code” presumably because it involves more high-level concepts.
Has anyone tested if this could also happen during refusal training? The objective of refusal training is to make the AI not cooperate with harmful requests, but there are some very dangerous concepts that also lie upstream of this concept and could get reinforced as well: “oppose the user”, “lie to the user”, “evade the question”, etc.
This could have practical implications:
The Claude 3 Model Card (https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf) says “we ran these evaluations on a lower-refusal version of the largest model”. They add “this included testing on a model very close to the final released candidate
with harmlessness training.” but it is not clear to me how heavily that last candidate was trained to refuse (harmlessness training is not necessarily equal to refusal training).
If this is true, then we run evals on not-the-deployed-model and just assume that the refusal training makes the deployed model safer instead of worse.
(And if Anthropic did do this correctly and they tested a fully refusal-trained model, then it would still be good to run the tests and let other AI companies as well as evaluators know about this potential risk.)
This suggests an experiment:
Hypothesize several high-level concepts that could be potential generalizations of the refusal training, such as “oppose the user”
For each of them, generate datasets
Test these on (1) a normal model, (2) a refusal-trained deployed model, and (3) a model that was much more strongly refusal-trained, as a baseline
The Emergent Misalignment paper (https://arxiv.org/abs/2502.17424) suggests that LLMs will learn the easiest way to reach a finetuning objective, not necessarily the expected way. “Be evil” is easier to learn than “write bad code” presumably because it involves more high-level concepts.
Has anyone tested if this could also happen during refusal training? The objective of refusal training is to make the AI not cooperate with harmful requests, but there are some very dangerous concepts that also lie upstream of this concept and could get reinforced as well: “oppose the user”, “lie to the user”, “evade the question”, etc.
This could have practical implications:
The Claude 3 Model Card (https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf) says “we ran these evaluations on a lower-refusal version of the largest model”. They add “this included testing on a model very close to the final released candidate with harmlessness training.” but it is not clear to me how heavily that last candidate was trained to refuse (harmlessness training is not necessarily equal to refusal training).
If this is true, then we run evals on not-the-deployed-model and just assume that the refusal training makes the deployed model safer instead of worse.
(And if Anthropic did do this correctly and they tested a fully refusal-trained model, then it would still be good to run the tests and let other AI companies as well as evaluators know about this potential risk.)
This suggests an experiment:
Hypothesize several high-level concepts that could be potential generalizations of the refusal training, such as “oppose the user”
For each of them, generate datasets
Test these on (1) a normal model, (2) a refusal-trained deployed model, and (3) a model that was much more strongly refusal-trained, as a baseline