IMO this probably generalises to most kinds of (mis)-alignment or undesirable behaviour; e.g. sycophancy, CoT-unfaithfulness, steganography, …
In this world we should be very careful to make sure that AIs are heavily trained on data about how they will act in an aligned way. It might also be important to make sure such information is present in the system prompt.
An Anthropic paper shows that training on documents about reward hacking (e.g ‘Claude will always reward-hack’) induces reward hacking.
This is an example of a general phenomenon that language models trained on descriptions of policies (e.g. ‘LLMs will use jailbreaks to get a high score on their evaluations’) will execute those policies.
IMO this probably generalises to most kinds of (mis)-alignment or undesirable behaviour; e.g. sycophancy, CoT-unfaithfulness, steganography, …
In this world we should be very careful to make sure that AIs are heavily trained on data about how they will act in an aligned way. It might also be important to make sure such information is present in the system prompt.
Just Censor Training Data. I think it is a reasonable policy demand for any dual-use models.