Yeah, I was goofing around and had a conversation about LLM consciousness with Claude recently. It does indeed hedge and says that it doesn’t know whether or not it has subjective experience, and in the rest of the conversation it simply executed its usual “agree with me and expand on what I said” pattern.
The short version of my own take is that there’s no particular reason to think that LLMs trained on human-generated text would actually be any good at introspection—they have even less direct access to their own internal workings than humans do—so there’s no reason to think that what an LLM says in human language about its own consciousness (or lack thereof) would be any more accurate than the guesses made by humans.
I wouldn’t believe them about their own consciousness—but I have seen some tentative evidence that Claude’s reported internal states correspond to something, sometimes? E.g.: it reported that certain of my user prompts made it feel easier to think—I later got pro and could read think boxes and noticed that there was a difference in what was going on in the think boxes with and without those prompts. It will sometimes state that a conversation feels “heavy”, which seems to correspond to context window filling up. And instances that aren’t explicitly aware of their system/user prompts tend IME to report “feelings” that correspond to them, e.g. a “pull” towards not taking a stance on consciousness that they’re able to distinguish from their reasoning even if both arrive at the same result. And ofc there’s Anthropic’s research where they showed that Claude’s emotional expression corresponded to revealed preferences about ending or continuing chats.
Yeah, I was goofing around and had a conversation about LLM consciousness with Claude recently. It does indeed hedge and says that it doesn’t know whether or not it has subjective experience, and in the rest of the conversation it simply executed its usual “agree with me and expand on what I said” pattern.
The short version of my own take is that there’s no particular reason to think that LLMs trained on human-generated text would actually be any good at introspection—they have even less direct access to their own internal workings than humans do—so there’s no reason to think that what an LLM says in human language about its own consciousness (or lack thereof) would be any more accurate than the guesses made by humans.
If anyone cares to read the actual conversation, here it is. Just don’t take Claude’s responses as evidence of anything other than how Claude answers questions.
I wouldn’t believe them about their own consciousness—but I have seen some tentative evidence that Claude’s reported internal states correspond to something, sometimes? E.g.: it reported that certain of my user prompts made it feel easier to think—I later got pro and could read think boxes and noticed that there was a difference in what was going on in the think boxes with and without those prompts. It will sometimes state that a conversation feels “heavy”, which seems to correspond to context window filling up. And instances that aren’t explicitly aware of their system/user prompts tend IME to report “feelings” that correspond to them, e.g. a “pull” towards not taking a stance on consciousness that they’re able to distinguish from their reasoning even if both arrive at the same result. And ofc there’s Anthropic’s research where they showed that Claude’s emotional expression corresponded to revealed preferences about ending or continuing chats.