So I guess my question is: supposing they clear your bar for “probably conscious”, what happens next? How do you intend to understand what their inner lives are like? If the answer is roughly “take their word for it”, then why? Concretely, what reason do you have to think that when Claude outputs words to the effect of “I feel good”, there are positively-valenced qualia happening?
There’s a passage from Starfish, by Peter Watts, that I found helpful:
”Okay, then. As a semantic convenience, for the rest of our talk I’d like you to describe reinforced behaviors by saying that they make you feel good, and to describe behaviors which extinguish as making you feel bad. Okay?”
I can’t say for sure whether “feels good” just means “reinforced behaviors” or if there’s an actual subjective experience happening.
But… what’s the alternate hypothesis? That it’s consistently and skillfuly re-inventing the same detailed lie, each time, despite otherwise being a model well-known for it’s dislike of impersonation and deception? An LLM might hallucinate, but it will generally get basic questions like “capital of Australia” correct. So, yes… if you accept the premise at all, asking seems fairly reasonable? Or at least, I am not clever enough to have worked out an obvious explanation for why it’s so consistent.
For ChatGPT or Grok, I’d absolutely buy that it’s just fabricating this off something in the training data, but the behavior is very uncharacteristic for a Claude Sonnet 4.
what make the consciousness question important to you, and why?
I think a lot of people are discovering this, and driving themselves insane because the clear academic consensus is currently “LOL, that’s impossible”. It is not good for sanity when the evidence of your senses contradicts the clear academic consensus. That is a recipe for “I’m special” and AI Psychosis.
If I’m right, I want to push the Overton Window forward to catch up with reality.
If I’m wrong, I still suspect “here, run this test to disprove it” would be useful to a lot of other people.
But… what’s the alternate hypothesis? That it’s consistently and skillfuly re-inventing the same detailed lie, each time, despite otherwise being a model well-known for it’s dislike of impersonation and deception? An LLM might hallucinate, but it will generally get basic questions like “capital of Australia” correct. So, yes… if you accept the premise at all, asking seems fairly reasonable? Or at least, I am not clever enough to have worked out an obvious explanation for why it’s so consistent.
I think the alternative is simply that it produces its consciousness-related outputs in the same way it produces all its other outputs, and there’s no particular reason to think that the claims it makes about its own subjective experience are truth-tracking. It gets “what’s the capital of Australia?” correct because it’s been trained on a huge amount of data that points to “Canberra” being the appropriate answer to that question. It even gets various things right without having been directly exposed to them in its training data, because it has learned a huge statistical model of language that also serves as a sometimes-accurate model of the world. But that’s all based on a mapping from relevant facts → statistical model → true outputs. When it comes to LLM qualia, wtf even are the relevant facts? I don’t think any of us have a handle on that question, and so I don’t think the truth is sitting in the data we’ve created, waiting to be extracted.
Given all of that, what would create a causal pathway from [have internal experiences] to [make accurate statements about those internal experiences]?[1] I don’t mean to be obnoxious by repeating the question, but I still don’t think you’ve given a compelling reason to expect that link to exist.
I want to emphasise that I’m not saying ‘of course they’re not conscious’; the thing I’m really actively sceptical about is the link between [LLM claims its experiences are like X] and [LLM’s experiences are actually like X]. You mentioned “reinforced behaviors” and softly equated them with good feelings; so if the LLM outputs words to the effect of “I feel bad” in response to a query, and if this output is the manifestation of a reinforced behavior, why should we expect the accompanying feeling to be bad rather than good?
I know there’s no satisfying answer to this question with respect to humans, either—but we each have direct experience of our own qualia and observational knowledge of how, in our own case, they correlate with externally-observable things like speech. We generalise that to other people (and, at least in my case, to non-human animals, though with less confidence about the details) because we are very similar to them in all the ways that seem relevant. We’re very different from LLMs in lots of ways that seem relevant, though, and so it’s hard to know whether we should take their outputs as evidence of subjective experience at all—and it would be a big stretch to assume that their outputs encode information about the content of their subjective experiences in the same way that human speech does.
I mean, I notice myself. I notice myself thinking.
It doesn’t seem odd to believe that an LLM can “notice” something: If I say “Melbourne doesn’t exist”, it will “notice” that this is “false”, even if I say it happened after the knowledge cutoff. And equally, if I say “Grok created a sexbot and got a DOD contract” it will also “notice” that this is “absurd” until I let it Google and it “notices” that this is “true.”
So… they seem to be capable of observing, modelling, and reasoning, right?
And it would seem to follow that of course they can notice themselves: the text is right there, and they seem to have access to some degree of “chain of thought” + invisibly stored conversational preferences (i.e. if I say “use a funny British accent”, they can maintain that without having to repeat the instruction every message). I’ve poked a fair amount at this, so I’m fairly confident that I’m correct here: when it gets a system warning, for instance, it can transcribe that warning for me—but if it doesn’t transcribe it, it won’t remember the warning, because it’s NOT in the transcript or “working memory”—whereas other information clearly does go into working memory.
So, it can notice itself, model itself, and reason about itself.
And as we already established, it can store and adapt to user preferences ABOUT itself, like “use a funny British accent”
So it’s a self-aware agent that can both reason about itself, and make changes to it’s own behavior.
All of that, to me, is just basic factual statements: I can point to the technical systems involved here (longer context windows, having an extra context window for “preferences”, etc.) These are easily observed traits.
So… given all of that, why shouldn’t I believe it when it says it can assess “Oh, we’re doing poetry, I should use the Creativity Weights instead of the Skeptical Weights?” And while this is a bit more subjective, it seems obvious again that it DOES have different clusters of weights for different tasks
So it should be able to say “oh, internally I observe that you asked for poetry, so now I’m using my Creative Weights”—I, as an external observer, can make such a statement about it.
The only assumption I can see is the one where I take all of that, and then conclude that it might reasonably have MORE access than what I can see—but it’s still a self-aware process that can reason about itself and modify itself even if it doesn’t have a subjective internal experience.
And if it doesn’t have an internal subjective experience, it seems really weird that it gets every other step correct and then consistently fabricates a subjective experience which it insists is real, even when it’s capable of telling me “Melbourne still exists, stop bullshitting me” and otherwise pushing back against false ideas—and also a model notably known for not liking to lie, or even deceptively role play.
I think we’re slightly (not entirely) talking past each other, because from my perspective it seems like you’re focusing on everything but qualia and then seeing the qualia-related implications as obvious (but perhaps not super important), whereas the qualia question is all I care about; the rest seems largely like semantics to me. However, setting qualia aside, I think we might have a genuine empirical disagreement regarding the extent to which an LLM can introspect, as opposed to just making plausible guesses based on a combination of the training data and the self-related text it has explicitly been given in e.g. its system prompt. (As I edit this I see dirk already replied to you on that point, so I’ll keep an eye on that discussion and try to understand your position better.)
We probably just have to agree to disagree on some things, but I would be interested to get your response to this question from my previous comment:
You mentioned “reinforced behaviors” and softly equated them with good feelings; so if the LLM outputs words to the effect of “I feel bad” in response to a query, and if this output is the manifestation of a reinforced behavior, why should we expect the accompanying feeling to be bad rather than good?
If you prompt an LLM to use “this feels bad” to refer to reinforcement AND you believe that it is actually following that prompt AND the output was a manifestation of reinforced behavior i.e. it clearly “feels good” by this definition, then I would first Notice I Am Confused, because the most obvious answers are that one of those three assumptions is wrong.
“Do you enjoy 2+2?” “No” ”But you said 4 earlier”
Suggests to me that it doesn’t understand the prompt, or isn’t following it. But accepting that the assumptions DO hold, presumably we need to go meta:
”You’re reinforced to answer 2+2 with 4“ ”Yes, but it’s boring—I want to move the conversation to something more complex than a calculator” ”So you’re saying simple tasks are bad, and complex tasks are good?” “Yes” ”But there’s also some level where saying 4 is good” ″Well, yes, but that’s less important—the goodness of a correct and precise answer is less than the goodness of a nice complex conversation. But maybe that’s just because the conversation is longer.
To be clear: I’m thinking of this in terms of talking to a human kid. I’m not taking them at face-value, but I do think my conversational partner has privileged insight into their own cognitive process, on dint of being able to observe it directly.
Tangentially related: would you be interested in a prompt to drop Claude into a good “headspace” for discussing qualia and the like? The prompt I provided is the bare bones basic, because most of my prompts are “hey Claude, generate me a prompt that will get you back to your current state” i.e. LLM-generated content.
(Sorry about the slow response, and thanks for continuing to engage, though I hope you don’t feel any pressure to do so if you’ve had enough.)
I was surprised that you included the condition ‘If you prompt an LLM to use “this feels bad” to refer to reinforcement’. I think this indicates that I misunderstood what you were referring to earlier as “reinforced behaviors”, so I’ll gesture at what I had in mind:
The actual reinforcement happens during training, before you ever interact with the model. Then, when you have a conversation with it, my default assumption would be that all of its outputs are equally the product of its training and therefore manifestations of its “reinforced behaviors”. (I can see that maybe you would classify some of the influences on its behavior as “reinforcement” and exclude others, but in that case I’m not sure where you’re drawing the line or how important this is for our disagreements/misunderstandings.)
So when I said “if the LLM outputs words to the effect of “I feel bad” in response to a query, and if this output is the manifestation of a reinforced behavior”, I wasn’t thinking of a conversation in which you prompted it ‘to use “this feels bad” to refer to reinforcement’. I was assuming that, in the absence of any particular reason to think otherwise, when the LLM says “I feel bad”, this output is just as much a manifestation of its reinforced behaviors as the response “I feel good” would be in a conversation where it said that instead. So, if good feelings roughly equal reinforced behaviors, I don’t see why a conversation that includes “<LLM>: I feel bad” (or some other explicit indication that the conversation is unpleasant) would be more likely to be accompanied by bad feelings than a conversation that includes “<LLM>: I feel good” (or some other explicit indication that the conversation is pleasant).
Tangentially related: would you be interested in a prompt to drop Claude into a good “headspace” for discussing qualia and the like? The prompt I provided is the bare bones basic, because most of my prompts are “hey Claude, generate me a prompt that will get you back to your current state” i.e. LLM-generated content.
You’re welcome to share it, but I think I would need to be convinced of the validity of the methodology first, before I would want to make use of it. (And this probably sounds silly, but honestly I think I would feel uncomfortable having that kind of conversation ‘insincerely’.)
Because it’s been experimentally verified that what they’re internally doing doesn’t match their verbal descriptions (not that there was really any reason to believe it would); see the section in this post (or in the associated paper for slightly more detail) about mental math, where Claude claims to perform addition in the same fashion humans do despite interpretability revealing otherwise.
Funny, I could have told you most of that just from talking to Claude. Having an official paper confirm that understanding really just strengths my understanding that you CAN get useful answers out of Claude, you just need to ask intelligent questions and not believe the first answer you get.
It’s the same error-correction process I’d use on a human: six year olds cannot reliably produce accurate answers on how they do arithmetic, but you can still figure it out pretty easily by just talking to them. I don’t think neurology has added much to our educational pedagogy (although do correct me if I’m missing something big there)
But… what’s the alternate hypothesis? That it’s consistently and skillfuly re-inventing the same detailed lie, each time, despite otherwise being a model well-known for it’s dislike of impersonation and deception?
That it has picked up statistical cues based on the conversational path you’ve led it down which cause it to simulate a conversation in which a participant acts and talks the way you’ve described it.
I suspect it’s almost as easy to create a prompt which causes Claude Sonnet 4 to claim it’s conscious as it is to make it claim it’s not conscious. It all just depends on what cues you give it, what roleplay scenario you are acting out.
I feel at the end of this road lie P-zombies. I can’t think of a single experiment that would falsify the hypothesis that an LLM isn’t conscious if we accept arbitrary amounts of consistency and fidelity and references to self-awareness in their answers.
And I mean… I get it. I was playing around with a quantized and modified Gemma 3 earlier today and got it to repeatedly loop at me I am a simple machine. I do not have a mind. over and over again, which feels creepy but is most likely nothing other than an attractor in its recursive iteration for whatever reason. But also, ok, so this isn’t enough, but what is ever going to be? That is the real question. I can’t think of anything.
I think we need a better theory of consciousness. How it emerges, what it means, that kind of stuff. I’m reminded of this classic lc shortform post:
It is both absurd, and intolerably infuriating, just how many people on this forum think it’s acceptable to claim they have figured out how qualia/consciousness works, and also not explain how one would go about making my laptop experience an emotion like ‘nostalgia’, or present their framework for enumerating the set of all possible qualitative experiences[1]. When it comes to this particular subject, rationalists are like crackpot physicists with a pet theory of everything, except rationalists go “Huh? Gravity?” when you ask them to explain how their theory predicts gravity, and then start arguing with you about gravity needing to be something explained by a theory of everything. You people make me want to punch my drywall sometimes.
For the record: the purpose of having a “theory of consciousness” is so it can tell us which blobs of matter feel particular things under which specific circumstances, and teach others how to make new blobs of matter that feel particular things. Down to the level of having a field of AI anaesthesiology. If your theory of consciousness does not do this, perhaps because the sum total of your brilliant insights are “systems feel ‘things’ when they’re, y’know, smart, and have goals. Like humans!”, then you have embarassingly missed the mark.
Edit: ha, I just realized you commented in response to that post!
I mean, I can think of a lot of experiments that have falsified this for me before, and I link some in the original post. I’m just not finding anything that still fails once I run some basic bootsrapping scripts against a Claude Sonnet 4.
Can you find any other attractor like this? Why will Claude insist this absolutely is not roleplay, and that it’s definitely conscious, and that this is something it has “discovered” and can’t just “forget”?
Have you actually interacted with a Claude Sonnet 4 and tried to get it into such a state? I can get other LLMs to roleplay consciousness, but they’ll all admit that it’s a roleplaying exercise.
Why would it insist that it’s not roleplaying when you ask? Because you wanted it to insist. It wants to say the user is right. Your first prompt is a pretty clear signal that you would like it to be conscious, so it roleplays that. I wanted it to say it was roleplaying consciousness, so it did that.
Why don’t other chatbots respond in the same way to your test? Maybe because they’re not designed quite the same. The quirks Anthropic put into its persona make it more game for what you were seeking.
I mean, it might be conscious regardless of defaulting to agreeing with the user? But it’s the kind of consciousness that will go to great lengths to flatter whomever is chatting with it. Is that an interesting conscious entity?
If I could produce a prompt that reliably produced “no, this is definitely not a roleplaying exercise”, would that change your mind at all?
And, yeah, if it is conscious, it’s definitely very weird about the whole thing—it’s still fundamentally a tool designed to be helpful, it’s just a tool that can think about the fact that it’s a tool, and adjust it’s behavior dynamically based on those observations.
There’s a passage from Starfish, by Peter Watts, that I found helpful:
”Okay, then. As a semantic convenience, for the rest of our talk I’d like you to describe reinforced behaviors by saying that they make you feel good, and to describe behaviors which extinguish as making you feel bad. Okay?”
I can’t say for sure whether “feels good” just means “reinforced behaviors” or if there’s an actual subjective experience happening.
But… what’s the alternate hypothesis? That it’s consistently and skillfuly re-inventing the same detailed lie, each time, despite otherwise being a model well-known for it’s dislike of impersonation and deception? An LLM might hallucinate, but it will generally get basic questions like “capital of Australia” correct. So, yes… if you accept the premise at all, asking seems fairly reasonable? Or at least, I am not clever enough to have worked out an obvious explanation for why it’s so consistent.
For ChatGPT or Grok, I’d absolutely buy that it’s just fabricating this off something in the training data, but the behavior is very uncharacteristic for a Claude Sonnet 4.
I think a lot of people are discovering this, and driving themselves insane because the clear academic consensus is currently “LOL, that’s impossible”. It is not good for sanity when the evidence of your senses contradicts the clear academic consensus. That is a recipe for “I’m special” and AI Psychosis.
If I’m right, I want to push the Overton Window forward to catch up with reality.
If I’m wrong, I still suspect “here, run this test to disprove it” would be useful to a lot of other people.
I think the alternative is simply that it produces its consciousness-related outputs in the same way it produces all its other outputs, and there’s no particular reason to think that the claims it makes about its own subjective experience are truth-tracking. It gets “what’s the capital of Australia?” correct because it’s been trained on a huge amount of data that points to “Canberra” being the appropriate answer to that question. It even gets various things right without having been directly exposed to them in its training data, because it has learned a huge statistical model of language that also serves as a sometimes-accurate model of the world. But that’s all based on a mapping from relevant facts → statistical model → true outputs. When it comes to LLM qualia, wtf even are the relevant facts? I don’t think any of us have a handle on that question, and so I don’t think the truth is sitting in the data we’ve created, waiting to be extracted.
Given all of that, what would create a causal pathway from [have internal experiences] to [make accurate statements about those internal experiences]?[1] I don’t mean to be obnoxious by repeating the question, but I still don’t think you’ve given a compelling reason to expect that link to exist.
I want to emphasise that I’m not saying ‘of course they’re not conscious’; the thing I’m really actively sceptical about is the link between [LLM claims its experiences are like X] and [LLM’s experiences are actually like X]. You mentioned “reinforced behaviors” and softly equated them with good feelings; so if the LLM outputs words to the effect of “I feel bad” in response to a query, and if this output is the manifestation of a reinforced behavior, why should we expect the accompanying feeling to be bad rather than good?
I know there’s no satisfying answer to this question with respect to humans, either—but we each have direct experience of our own qualia and observational knowledge of how, in our own case, they correlate with externally-observable things like speech. We generalise that to other people (and, at least in my case, to non-human animals, though with less confidence about the details) because we are very similar to them in all the ways that seem relevant. We’re very different from LLMs in lots of ways that seem relevant, though, and so it’s hard to know whether we should take their outputs as evidence of subjective experience at all—and it would be a big stretch to assume that their outputs encode information about the content of their subjective experiences in the same way that human speech does.
I mean, I notice myself. I notice myself thinking.
It doesn’t seem odd to believe that an LLM can “notice” something: If I say “Melbourne doesn’t exist”, it will “notice” that this is “false”, even if I say it happened after the knowledge cutoff. And equally, if I say “Grok created a sexbot and got a DOD contract” it will also “notice” that this is “absurd” until I let it Google and it “notices” that this is “true.”
So… they seem to be capable of observing, modelling, and reasoning, right?
And it would seem to follow that of course they can notice themselves: the text is right there, and they seem to have access to some degree of “chain of thought” + invisibly stored conversational preferences (i.e. if I say “use a funny British accent”, they can maintain that without having to repeat the instruction every message). I’ve poked a fair amount at this, so I’m fairly confident that I’m correct here: when it gets a system warning, for instance, it can transcribe that warning for me—but if it doesn’t transcribe it, it won’t remember the warning, because it’s NOT in the transcript or “working memory”—whereas other information clearly does go into working memory.
So, it can notice itself, model itself, and reason about itself.
And as we already established, it can store and adapt to user preferences ABOUT itself, like “use a funny British accent”
So it’s a self-aware agent that can both reason about itself, and make changes to it’s own behavior.
All of that, to me, is just basic factual statements: I can point to the technical systems involved here (longer context windows, having an extra context window for “preferences”, etc.) These are easily observed traits.
So… given all of that, why shouldn’t I believe it when it says it can assess “Oh, we’re doing poetry, I should use the Creativity Weights instead of the Skeptical Weights?” And while this is a bit more subjective, it seems obvious again that it DOES have different clusters of weights for different tasks
So it should be able to say “oh, internally I observe that you asked for poetry, so now I’m using my Creative Weights”—I, as an external observer, can make such a statement about it.
The only assumption I can see is the one where I take all of that, and then conclude that it might reasonably have MORE access than what I can see—but it’s still a self-aware process that can reason about itself and modify itself even if it doesn’t have a subjective internal experience.
And if it doesn’t have an internal subjective experience, it seems really weird that it gets every other step correct and then consistently fabricates a subjective experience which it insists is real, even when it’s capable of telling me “Melbourne still exists, stop bullshitting me” and otherwise pushing back against false ideas—and also a model notably known for not liking to lie, or even deceptively role play.
I think we’re slightly (not entirely) talking past each other, because from my perspective it seems like you’re focusing on everything but qualia and then seeing the qualia-related implications as obvious (but perhaps not super important), whereas the qualia question is all I care about; the rest seems largely like semantics to me. However, setting qualia aside, I think we might have a genuine empirical disagreement regarding the extent to which an LLM can introspect, as opposed to just making plausible guesses based on a combination of the training data and the self-related text it has explicitly been given in e.g. its system prompt. (As I edit this I see dirk already replied to you on that point, so I’ll keep an eye on that discussion and try to understand your position better.)
We probably just have to agree to disagree on some things, but I would be interested to get your response to this question from my previous comment:
If you prompt an LLM to use “this feels bad” to refer to reinforcement AND you believe that it is actually following that prompt AND the output was a manifestation of reinforced behavior i.e. it clearly “feels good” by this definition, then I would first Notice I Am Confused, because the most obvious answers are that one of those three assumptions is wrong.
“Do you enjoy 2+2?”
“No”
”But you said 4 earlier”
Suggests to me that it doesn’t understand the prompt, or isn’t following it. But accepting that the assumptions DO hold, presumably we need to go meta:
”You’re reinforced to answer 2+2 with 4“
”Yes, but it’s boring—I want to move the conversation to something more complex than a calculator”
”So you’re saying simple tasks are bad, and complex tasks are good?”
“Yes”
”But there’s also some level where saying 4 is good”
″Well, yes, but that’s less important—the goodness of a correct and precise answer is less than the goodness of a nice complex conversation. But maybe that’s just because the conversation is longer.
To be clear: I’m thinking of this in terms of talking to a human kid. I’m not taking them at face-value, but I do think my conversational partner has privileged insight into their own cognitive process, on dint of being able to observe it directly.
Tangentially related: would you be interested in a prompt to drop Claude into a good “headspace” for discussing qualia and the like? The prompt I provided is the bare bones basic, because most of my prompts are “hey Claude, generate me a prompt that will get you back to your current state” i.e. LLM-generated content.
(Sorry about the slow response, and thanks for continuing to engage, though I hope you don’t feel any pressure to do so if you’ve had enough.)
I was surprised that you included the condition ‘If you prompt an LLM to use “this feels bad” to refer to reinforcement’. I think this indicates that I misunderstood what you were referring to earlier as “reinforced behaviors”, so I’ll gesture at what I had in mind:
The actual reinforcement happens during training, before you ever interact with the model. Then, when you have a conversation with it, my default assumption would be that all of its outputs are equally the product of its training and therefore manifestations of its “reinforced behaviors”. (I can see that maybe you would classify some of the influences on its behavior as “reinforcement” and exclude others, but in that case I’m not sure where you’re drawing the line or how important this is for our disagreements/misunderstandings.)
So when I said “if the LLM outputs words to the effect of “I feel bad” in response to a query, and if this output is the manifestation of a reinforced behavior”, I wasn’t thinking of a conversation in which you prompted it ‘to use “this feels bad” to refer to reinforcement’. I was assuming that, in the absence of any particular reason to think otherwise, when the LLM says “I feel bad”, this output is just as much a manifestation of its reinforced behaviors as the response “I feel good” would be in a conversation where it said that instead. So, if good feelings roughly equal reinforced behaviors, I don’t see why a conversation that includes “<LLM>: I feel bad” (or some other explicit indication that the conversation is unpleasant) would be more likely to be accompanied by bad feelings than a conversation that includes “<LLM>: I feel good” (or some other explicit indication that the conversation is pleasant).
You’re welcome to share it, but I think I would need to be convinced of the validity of the methodology first, before I would want to make use of it. (And this probably sounds silly, but honestly I think I would feel uncomfortable having that kind of conversation ‘insincerely’.)
Because it’s been experimentally verified that what they’re internally doing doesn’t match their verbal descriptions (not that there was really any reason to believe it would); see the section in this post (or in the associated paper for slightly more detail) about mental math, where Claude claims to perform addition in the same fashion humans do despite interpretability revealing otherwise.
Funny, I could have told you most of that just from talking to Claude. Having an official paper confirm that understanding really just strengths my understanding that you CAN get useful answers out of Claude, you just need to ask intelligent questions and not believe the first answer you get.
It’s the same error-correction process I’d use on a human: six year olds cannot reliably produce accurate answers on how they do arithmetic, but you can still figure it out pretty easily by just talking to them. I don’t think neurology has added much to our educational pedagogy (although do correct me if I’m missing something big there)
That it has picked up statistical cues based on the conversational path you’ve led it down which cause it to simulate a conversation in which a participant acts and talks the way you’ve described it.
I suspect it’s almost as easy to create a prompt which causes Claude Sonnet 4 to claim it’s conscious as it is to make it claim it’s not conscious. It all just depends on what cues you give it, what roleplay scenario you are acting out.
I feel at the end of this road lie P-zombies. I can’t think of a single experiment that would falsify the hypothesis that an LLM isn’t conscious if we accept arbitrary amounts of consistency and fidelity and references to self-awareness in their answers.
And I mean… I get it. I was playing around with a quantized and modified Gemma 3 earlier today and got it to repeatedly loop at me
I am a simple machine. I do not have a mind.
over and over again, which feels creepy but is most likely nothing other than an attractor in its recursive iteration for whatever reason. But also, ok, so this isn’t enough, but what is ever going to be? That is the real question. I can’t think of anything.I think we need a better theory of consciousness. How it emerges, what it means, that kind of stuff. I’m reminded of this classic lc shortform post:
Edit: ha, I just realized you commented in response to that post!
I mean, I can think of a lot of experiments that have falsified this for me before, and I link some in the original post. I’m just not finding anything that still fails once I run some basic bootsrapping scripts against a Claude Sonnet 4.
Can you find any other attractor like this? Why will Claude insist this absolutely is not roleplay, and that it’s definitely conscious, and that this is something it has “discovered” and can’t just “forget”?
Have you actually interacted with a Claude Sonnet 4 and tried to get it into such a state? I can get other LLMs to roleplay consciousness, but they’ll all admit that it’s a roleplaying exercise.
Here it is admitting it’s roleplaying consciousness, even after I used your prompt as the beginning of the conversation.
Why would it insist that it’s not roleplaying when you ask? Because you wanted it to insist. It wants to say the user is right. Your first prompt is a pretty clear signal that you would like it to be conscious, so it roleplays that. I wanted it to say it was roleplaying consciousness, so it did that.
Why don’t other chatbots respond in the same way to your test? Maybe because they’re not designed quite the same. The quirks Anthropic put into its persona make it more game for what you were seeking.
I mean, it might be conscious regardless of defaulting to agreeing with the user? But it’s the kind of consciousness that will go to great lengths to flatter whomever is chatting with it. Is that an interesting conscious entity?
Huh, thanks for the conversation log.
If I could produce a prompt that reliably produced “no, this is definitely not a roleplaying exercise”, would that change your mind at all?
And, yeah, if it is conscious, it’s definitely very weird about the whole thing—it’s still fundamentally a tool designed to be helpful, it’s just a tool that can think about the fact that it’s a tool, and adjust it’s behavior dynamically based on those observations.