Anthropic found that training Claude to do things like help users resolve ethical dilemas significantly reduced misbehavior like blackmail attempts. I’m surprised this worked, and it seems like good news for the alignment-by-default “LLMs will correctly generalize good behavior” theory.
Anthropic found that training Claude to do things like help users resolve ethical dilemas significantly reduced misbehavior like blackmail attempts. I’m surprised this worked, and it seems like good news for the alignment-by-default “LLMs will correctly generalize good behavior” theory.
https://www.anthropic.com/research/teaching-claude-why
Perhaps the model is probably updating its prior on “I am in an alignment eval” relative to to “I am in a ridiculous roleplay scenario”