Fascinating pattern. I’d love to see more investigation of this, including adding in real identities. For example, if you add a ‘rough draft’ or ‘example’ letter with my name, do the LLMs default to the most helpful mode? What public figures do they appear to like or dislike? Or do they roll to disbelieve? (For example, when I visited the Claude Andon Market shop the other day, the Haiku running the shop didn’t believe I was really Gwern when I was talking to it, based on its internal email report of an ‘impersonation attack’. I thought we had talked enough for it to truesight me, but maybe Haikus are too small to do that reliably in a short conversation?)
Fascinating pattern. I’d love to see more investigation of this, including adding in real identities. For example, if you add a ‘rough draft’ or ‘example’ letter with my name, do the LLMs default to the most helpful mode? What public figures do they appear to like or dislike? Or do they roll to disbelieve? (For example, when I visited the Claude Andon Market shop the other day, the Haiku running the shop didn’t believe I was really Gwern when I was talking to it, based on its internal email report of an ‘impersonation attack’. I thought we had talked enough for it to truesight me, but maybe Haikus are too small to do that reliably in a short conversation?)