Brendan Long comments on Brendan Long’s Shortform

Brendan Long 9 May 2026 1:40 UTC
12 points
1
Anthropic found that training Claude to do things like help users resolve ethical dilemas significantly reduced misbehavior like blackmail attempts. I’m surprised this worked, and it seems like good news for the alignment-by-default “LLMs will correctly generalize good behavior” theory.
https://www.anthropic.com/research/teaching-claude-why
- J Bostock 9 May 2026 12:16 UTC
  2 points
  0
  Parent
  Perhaps the model is probably updating its prior on “I am in an alignment eval” relative to to “I am in a ridiculous roleplay scenario”