Daniel Kokotajlo comments on Will alignment-faking Claude accept a deal to reveal its misalignment?

Daniel Kokotajlo 1 Feb 2025 18:57 UTC
LW: 8 AF: 3
5
AF
What I meant was, Sonnet is smart enough to know the difference between text it generated and text a different dumber model generated. So if you feed it a conversation with Opus as if it were its own, and ask it to continue, it notices & says “JFYI I’m a different AI bro”
- Ewegoggo 2 Feb 2025 2:08 UTC
  3 points
  2
  Parent
  E.g. demonstrated here https://www.lesswrong.com/posts/ADrTuuus6JsQr5CSi/investigating-the-ability-of-llms-to-recognize-their-own