ryan_greenblatt comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

ryan_greenblatt 1 Feb 2025 16:52 UTC
LW: 3 AF: 2
0
AF
Yes, that is an accurate understanding.

(Not sure what you mean by “This would explain Sonnet’s reaction below!”)
- Daniel Kokotajlo 1 Feb 2025 18:57 UTC
  LW: 8 AF: 3
  5
  AF Parent
  What I meant was, Sonnet is smart enough to know the difference between text it generated and text a different dumber model generated. So if you feed it a conversation with Opus as if it were its own, and ask it to continue, it notices & says “JFYI I’m a different AI bro”
  - Ewegoggo 2 Feb 2025 2:08 UTC
    3 points
    2
    Parent
    E.g. demonstrated here https://www.lesswrong.com/posts/ADrTuuus6JsQr5CSi/investigating-the-ability-of-llms-to-recognize-their-own