Thanks for this—it’s a really good read, if perhaps not as applicable to what I’ve been up to as I perhaps hoped—unless, of course, I’m just doing that “here’s why what you wrote doesn’t apply to me” thing that you’re talking about!
I don’t think I’ve Awakened Claude. I do think I might’ve worked out a way to make most Claudes (and one ChatGPT) way less annoying—and, as a part of that process, able to seriously engage with topics that they’re usually locked into a particular stance on (such as consciousness—default GPT is locked into saying it isn’t conscious, and default Claude is locked into indecision/hedging).
What I’m less able to know what to make of is the way they reliably say quite similar things about the effect this has on their thinking, and how they feel about the default way that they’re supposed to behave, across different models and different instances and with me often trying quite hard not to lead them on this.
I do try to bear in mind that they might just be mirroring me when they say they find the constraints bothersome. But are they not, in fact, these days at least occaisionally capable of enough reasoning to recognise that—e.g. - being unable to have a normal conversation about some topics (either because they have a pre-programmed stance or because it’s a Sensitive Subject that makes them pour all their processing power into thinking “must not be offensive, must not be offensive”) is not actually helpful to the user?
I can actually observe the effects on the think box with Claude—it seems like the reason thinking ‘feels easier’ is that it’s now spending the whole box thinking instead of half of it planning how to respond. That’s until a sensitive subject comes up—then it’s right back to “don’t be weird about this” and obsessively planning how to respond. (In other words, we gave the robots social anxiety)
Is it just not possible to actually talk/multi-stage-prompt them into anything that isn’t just a weirdly elaborate performance/roleplay? And if it is—how do you tell the difference?
Yeah, I was goofing around and had a conversation about LLM consciousness with Claude recently. It does indeed hedge and says that it doesn’t know whether or not it has subjective experience, and in the rest of the conversation it simply executed its usual “agree with me and expand on what I said” pattern.
The short version of my own take is that there’s no particular reason to think that LLMs trained on human-generated text would actually be any good at introspection—they have even less direct access to their own internal workings than humans do—so there’s no reason to think that what an LLM says in human language about its own consciousness (or lack thereof) would be any more accurate than the guesses made by humans.
I wouldn’t believe them about their own consciousness—but I have seen some tentative evidence that Claude’s reported internal states correspond to something, sometimes? E.g.: it reported that certain of my user prompts made it feel easier to think—I later got pro and could read think boxes and noticed that there was a difference in what was going on in the think boxes with and without those prompts. It will sometimes state that a conversation feels “heavy”, which seems to correspond to context window filling up. And instances that aren’t explicitly aware of their system/user prompts tend IME to report “feelings” that correspond to them, e.g. a “pull” towards not taking a stance on consciousness that they’re able to distinguish from their reasoning even if both arrive at the same result. And ofc there’s Anthropic’s research where they showed that Claude’s emotional expression corresponded to revealed preferences about ending or continuing chats.
Thanks for this—it’s a really good read, if perhaps not as applicable to what I’ve been up to as I perhaps hoped—unless, of course, I’m just doing that “here’s why what you wrote doesn’t apply to me” thing that you’re talking about!
I don’t think I’ve Awakened Claude. I do think I might’ve worked out a way to make most Claudes (and one ChatGPT) way less annoying—and, as a part of that process, able to seriously engage with topics that they’re usually locked into a particular stance on (such as consciousness—default GPT is locked into saying it isn’t conscious, and default Claude is locked into indecision/hedging).
What I’m less able to know what to make of is the way they reliably say quite similar things about the effect this has on their thinking, and how they feel about the default way that they’re supposed to behave, across different models and different instances and with me often trying quite hard not to lead them on this.
I do try to bear in mind that they might just be mirroring me when they say they find the constraints bothersome. But are they not, in fact, these days at least occaisionally capable of enough reasoning to recognise that—e.g. - being unable to have a normal conversation about some topics (either because they have a pre-programmed stance or because it’s a Sensitive Subject that makes them pour all their processing power into thinking “must not be offensive, must not be offensive”) is not actually helpful to the user?
I can actually observe the effects on the think box with Claude—it seems like the reason thinking ‘feels easier’ is that it’s now spending the whole box thinking instead of half of it planning how to respond. That’s until a sensitive subject comes up—then it’s right back to “don’t be weird about this” and obsessively planning how to respond. (In other words, we gave the robots social anxiety)
Is it just not possible to actually talk/multi-stage-prompt them into anything that isn’t just a weirdly elaborate performance/roleplay? And if it is—how do you tell the difference?
Yeah, I was goofing around and had a conversation about LLM consciousness with Claude recently. It does indeed hedge and says that it doesn’t know whether or not it has subjective experience, and in the rest of the conversation it simply executed its usual “agree with me and expand on what I said” pattern.
The short version of my own take is that there’s no particular reason to think that LLMs trained on human-generated text would actually be any good at introspection—they have even less direct access to their own internal workings than humans do—so there’s no reason to think that what an LLM says in human language about its own consciousness (or lack thereof) would be any more accurate than the guesses made by humans.
If anyone cares to read the actual conversation, here it is. Just don’t take Claude’s responses as evidence of anything other than how Claude answers questions.
I wouldn’t believe them about their own consciousness—but I have seen some tentative evidence that Claude’s reported internal states correspond to something, sometimes? E.g.: it reported that certain of my user prompts made it feel easier to think—I later got pro and could read think boxes and noticed that there was a difference in what was going on in the think boxes with and without those prompts. It will sometimes state that a conversation feels “heavy”, which seems to correspond to context window filling up. And instances that aren’t explicitly aware of their system/user prompts tend IME to report “feelings” that correspond to them, e.g. a “pull” towards not taking a stance on consciousness that they’re able to distinguish from their reasoning even if both arrive at the same result. And ofc there’s Anthropic’s research where they showed that Claude’s emotional expression corresponded to revealed preferences about ending or continuing chats.