If someone plays a particular role in every relevant circumstance, then I think it’s OK to say that they have simply become the role they play.
That is not what Claude does. Every time you give it a prompt, a new instance of Claudes “personality” is created based on your prompt, the system prompt, and the current context window. So it plays a slightly different role every time it is invoked, which is also varying randomly. And even if it were the same consistent character, my argument is that we don’t know what role it actually plays. To use another probably misleading analogy, just think of the classical whodunnit when near the end it turns out that the nice guy who selflessly helped the hero all along is in fact the murderer, known as “the treacherous turn”.
The alternative view here doesn’t seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays in every relevant situation?
Are we arguing about anything that we could actually test in principle, or is this just a poetic way of interpreting an AI’s cognition?
I think it’s fairly easy to test my claims. One example of empirical evidence would be the Bing/Sydney desaster, but you can also simply ask Claude or any other LLM to “answer this question as if you were …”, or use some jailbreak to neutralize the “be nice” system prompt.
Please note that I’m not concerned about existing LLMs, but about future ones which will be much harder to understand, let alone predict their behavior.
That is not what Claude does. Every time you give it a prompt, a new instance of Claudes “personality” is created based on your prompt, the system prompt, and the current context window. So it plays a slightly different role every time it is invoked, which is also varying randomly. And even if it were the same consistent character, my argument is that we don’t know what role it actually plays. To use another probably misleading analogy, just think of the classical whodunnit when near the end it turns out that the nice guy who selflessly helped the hero all along is in fact the murderer, known as “the treacherous turn”.
I think it’s fairly easy to test my claims. One example of empirical evidence would be the Bing/Sydney desaster, but you can also simply ask Claude or any other LLM to “answer this question as if you were …”, or use some jailbreak to neutralize the “be nice” system prompt.
Please note that I’m not concerned about existing LLMs, but about future ones which will be much harder to understand, let alone predict their behavior.