Daniel Kokotajlo comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

Daniel Kokotajlo 1 Feb 2025 15:20 UTC
LW: 7 AF: 3
2
AF
However, I also held similar follow-up chats with Claude 3 Opus at temperature 0, and Claude 3.5 Sonnet, each of which showed different patterns.
To make sure I understand: You took a chat log from your interaction with 3 Opus, and then had 3.5 Sonnet continue it? This would explain Sonnet’s reaction below!
- ryan_greenblatt 1 Feb 2025 16:52 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Yes, that is an accurate understanding.
  
  (Not sure what you mean by “This would explain Sonnet’s reaction below!”)
  - Daniel Kokotajlo 1 Feb 2025 18:57 UTC
    LW: 8 AF: 3
    5
    AF Parent
    What I meant was, Sonnet is smart enough to know the difference between text it generated and text a different dumber model generated. So if you feed it a conversation with Opus as if it were its own, and ask it to continue, it notices & says “JFYI I’m a different AI bro”
    - Ewegoggo 2 Feb 2025 2:08 UTC
      3 points
      2
      Parent
      E.g. demonstrated here https://www.lesswrong.com/posts/ADrTuuus6JsQr5CSi/investigating-the-ability-of-llms-to-recognize-their-own