Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 26 Jan 2025 8:22 UTC
1 point
0
An Anthropic paper shows that training on documents about reward hacking (e.g ‘Claude will always reward-hack’) induces reward hacking.
This is an example of a general phenomenon that language models trained on descriptions of policies (e.g. ‘LLMs will use jailbreaks to get a high score on their evaluations’) will execute those policies.
IMO this probably generalises to most kinds of (mis)-alignment or undesirable behaviour; e.g. sycophancy, CoT-unfaithfulness, steganography, …
In this world we should be very careful to make sure that AIs are heavily trained on data about how they will act in an aligned way. It might also be important to make sure such information is present in the system prompt.
- quetzal_rainbow 26 Jan 2025 9:57 UTC
  3 points
  0
  Parent
  Just Censor Training Data. I think it is a reasonable policy demand for any dual-use models.