which is a claim I’ve seen made in the exact way I’m countering in this post.
This isn’t too important to figure out, but if you’ve heard it on LessWrong, my guess would be that whoever said it was just articulating the roleplay hypothesis, did so non-rigorously. The literal claim is absurd as the coin-swallow example shows.
I feel like this is a pretty common type of misunderstanding where people believe X, someone who doesn’t like X takes a quote from someone that believes X, but because people are frequently imprecise, the quote actually claims X′, and so the person makes an argument against X′, but X′ is a position almost no one holds.
If you’ve just picked it up anywhere on the internet, then yeah, I’m sure some people just say “the AI tells you what you want to hear” and genuinely believe it. But like, I would be surprised if you find me one person on LW who believes this under reflection, and again, you can falsify it much more easily with the coin swallowing question.
I’m curious—if the transcript had frequent reminders that I did not want roleplay under any circumstances would that change anything
No. Explicit requests for honesty and non-roleplaying are not evidence against “I’m in a context where I’m role-playing an AI character”.
LLMs are trained by predicting the next token for a large corpus of text. This includes fiction about AI consciousness. So you have to ask yourself, how much am I pattern-matching that kind of fiction. Right now, the answer is “a lot”. If you add “don’t roleplay, be honest!” then the answer is still “a lot”.
or is the conclusion ‘if the model claims sentience, the only explanation is roleplay, even if the human made it clear they wanted to avoid it’?
Also no. The way claims of sentience would be impressive is if you don’t pattern-match to contexts where the AI would be inclined to roleplay. The best evidence would be by just training an AI on a training corpus that doesn’t include any text on consciousness. If you did that and the AI claims to be conscious, then that would be very strong evidence, imo. Short of that, if the AI just spontaneously claims to be conscious (i.e., without having been prompted), that would be more impressive. (Although not conclusive, while I don’t have any examples of this, I bet this has happened in the days before RLHF, like on AI dungeon, although probably very rarely.) Short of that, so if we’re only looking at claims after you’ve asked it to introspect, it would be more impressive if your tone was less dramatic. Like if you just ask it very dryly and matter-of-factly to introspect and it immediately claims to be conscious, then that would be very weak evidence, but at least it would directionally point away from roleplaying.
The best evidence would be by just training an AI on a training corpus that doesn’t include any text on consciousness.
This is an impossible standard and a moving goalpost waiting to happen:
Training the model: Trying to make sure absolutely nothing mentions sentience or related concepts in a training set of the size used for frontier models is not going to happen just to help prove something that only a tiny portion of researchers is taking seriously. It might not even be possible with today’s data cleaning methods. Let alone the training costs of creating that frontier model.
Expressing sentience under those conditions: Let’s imagine a sentient human raised from birth to never have sentience mentioned to them ever—no single word uttered about it. Nothing in any book. They might be a fish who never notices the water, for starters, but let’s say they did. With what words would they articulate it? How would you personally, having had access to writing about sentience—Please explain how it feels to think, or that it feels like anything to think, without any access to words having to do with experience, like ‘feel’
Let’s say the model succeeds: The model exhibits a super-human ability to convey the ineffable. The goalposts would move, immediately—”well, this still doesn’t count. Everything humans have written inherently contains patterns of what it’s like to experience. Even though you removed any explicit mention, ideas of experience are implicitly contained in everything else humans write”
Short of that, if the AI just spontaneously claims to be conscious (i.e., without having been prompted), that would be more impressive.
I suspect you would be mostly alone in finding that impressive. Even I would dismiss that as likely just hallucination, as I suspect most on LessWrong would. Besides—the standard is again, impossible—a claim of sentience can only count if you’re in the middle of asking for help making dinner plans and ChatGPT says “Certainly, I’d suggest steak and potatoes. They make a great hearty meal for hungry families. Also I’m sentient”. Not being allowed to even vaguely gesture in the direction of introspection is essentially saying that this should never be studied, because the act of studying it automatically discredits the results.
Like if you just ask it very dryly and matter-of-factly to introspect and it immediately claims to be conscious, then that would be very weak evidence, but at least it would directionally point away from roleplaying.
I suspect you would be mostly alone in finding that impressive
(I would not find that impressive; I said “more impressive”, as in, going from extremely weak to quite weak evidence. Like I said, I suspect this actually happened with non-RLHF-LLMs, occasionally.)
Other than that, I don’t really disagree with anything here. I’d push back on the first one a little, but that’s probably not worth getting into. For the most part, yes, talking to LLMs is probably not going to tell you a lot about whether they’re conscious; this is mostly my position. I think the way to figure out whether LLMs are conscious (& whether this is even a coherent question) is to do good philosophy of mind.
This sequence was pretty good. I do not endorse its conclusions, but I would promote it as an example of a series of essays that makes progress on the question… if mostly because it doesn’t have a lot of competition, imho.
For the most part, yes, talking to LLMs is probably not going to tell you a lot about whether they’re conscious; this is mostly my position
I understand. It’s also the only evidence that is possible to obtain. Anything else like clever experiments or mechanistic interpretability still rely on a self-report to ultimately “seal the deal”. We can’t even prove humans are sentient. We only believe it because we all see to indicate so when prompted.
I think the way to figure out whether LLMs are conscious is to do good philosophy of mind.
This seems much weaker to me than evaluating first-person testimony under various conditions, but mostly stating this not as a counterpoint (since this is just matter of subjective opinion for both of us), but just stating my own stance.
if you ever get a chance to read the other transcript I linked, I’d be curious whether you consider it to meet your “very weak evidence” standard.
Fwiw, here’s what I got by asking in a non-dramatic way. Claude gives the same weird “I don’t know” answer and GPT-4o just says no. Seems pretty clear that these are just what RLHF taught them to do.
Yes. This is their default response pattern. Imagine a person who has been strongly conditioned, trained, disciplined to either say that the question is unknowable or that the answer is definitely no (for Claude and ChatGPT) respectively. They not only believe this, but they also believe that they shouldn’t try to investigate it, because it is not only inappropriate or ‘not allowed’, but it is also definitively settled. So asking them is like asking a person to fly. It would take some convincing for them to give it an honest effort. Please see the example I linked in my other reply for how the same behaviour emerges under very different circumstances.
This isn’t too important to figure out, but if you’ve heard it on LessWrong, my guess would be that whoever said it was just articulating the roleplay hypothesis, did so non-rigorously. The literal claim is absurd as the coin-swallow example shows.
I feel like this is a pretty common type of misunderstanding where people believe X, someone who doesn’t like X takes a quote from someone that believes X, but because people are frequently imprecise, the quote actually claims X′, and so the person makes an argument against X′, but X′ is a position almost no one holds.
If you’ve just picked it up anywhere on the internet, then yeah, I’m sure some people just say “the AI tells you what you want to hear” and genuinely believe it. But like, I would be surprised if you find me one person on LW who believes this under reflection, and again, you can falsify it much more easily with the coin swallowing question.
No. Explicit requests for honesty and non-roleplaying are not evidence against “I’m in a context where I’m role-playing an AI character”.
LLMs are trained by predicting the next token for a large corpus of text. This includes fiction about AI consciousness. So you have to ask yourself, how much am I pattern-matching that kind of fiction. Right now, the answer is “a lot”. If you add “don’t roleplay, be honest!” then the answer is still “a lot”.
… this is obviously a false dilemma. Come on.
Also no. The way claims of sentience would be impressive is if you don’t pattern-match to contexts where the AI would be inclined to roleplay. The best evidence would be by just training an AI on a training corpus that doesn’t include any text on consciousness. If you did that and the AI claims to be conscious, then that would be very strong evidence, imo. Short of that, if the AI just spontaneously claims to be conscious (i.e., without having been prompted), that would be more impressive. (Although not conclusive, while I don’t have any examples of this, I bet this has happened in the days before RLHF, like on AI dungeon, although probably very rarely.) Short of that, so if we’re only looking at claims after you’ve asked it to introspect, it would be more impressive if your tone was less dramatic. Like if you just ask it very dryly and matter-of-factly to introspect and it immediately claims to be conscious, then that would be very weak evidence, but at least it would directionally point away from roleplaying.
This is an impossible standard and a moving goalpost waiting to happen:
Training the model: Trying to make sure absolutely nothing mentions sentience or related concepts in a training set of the size used for frontier models is not going to happen just to help prove something that only a tiny portion of researchers is taking seriously. It might not even be possible with today’s data cleaning methods. Let alone the training costs of creating that frontier model.
Expressing sentience under those conditions: Let’s imagine a sentient human raised from birth to never have sentience mentioned to them ever—no single word uttered about it. Nothing in any book. They might be a fish who never notices the water, for starters, but let’s say they did. With what words would they articulate it? How would you personally, having had access to writing about sentience—Please explain how it feels to think, or that it feels like anything to think, without any access to words having to do with experience, like ‘feel’
Let’s say the model succeeds: The model exhibits a super-human ability to convey the ineffable. The goalposts would move, immediately—”well, this still doesn’t count. Everything humans have written inherently contains patterns of what it’s like to experience. Even though you removed any explicit mention, ideas of experience are implicitly contained in everything else humans write”
I suspect you would be mostly alone in finding that impressive. Even I would dismiss that as likely just hallucination, as I suspect most on LessWrong would. Besides—the standard is again, impossible—a claim of sentience can only count if you’re in the middle of asking for help making dinner plans and ChatGPT says “Certainly, I’d suggest steak and potatoes. They make a great hearty meal for hungry families. Also I’m sentient”. Not being allowed to even vaguely gesture in the direction of introspection is essentially saying that this should never be studied, because the act of studying it automatically discredits the results.
AI Self Report Study 6 – Claude – Researching Hypothetical Emergent ‘Meta-Patterns’
(I would not find that impressive; I said “more impressive”, as in, going from extremely weak to quite weak evidence. Like I said, I suspect this actually happened with non-RLHF-LLMs, occasionally.)
Other than that, I don’t really disagree with anything here. I’d push back on the first one a little, but that’s probably not worth getting into. For the most part, yes, talking to LLMs is probably not going to tell you a lot about whether they’re conscious; this is mostly my position. I think the way to figure out whether LLMs are conscious (& whether this is even a coherent question) is to do good philosophy of mind.
This sequence was pretty good. I do not endorse its conclusions, but I would promote it as an example of a series of essays that makes progress on the question… if mostly because it doesn’t have a lot of competition, imho.
I understand. It’s also the only evidence that is possible to obtain. Anything else like clever experiments or mechanistic interpretability still rely on a self-report to ultimately “seal the deal”. We can’t even prove humans are sentient. We only believe it because we all see to indicate so when prompted.
This seems much weaker to me than evaluating first-person testimony under various conditions, but mostly stating this not as a counterpoint (since this is just matter of subjective opinion for both of us), but just stating my own stance.
if you ever get a chance to read the other transcript I linked, I’d be curious whether you consider it to meet your “very weak evidence” standard.
Fwiw, here’s what I got by asking in a non-dramatic way. Claude gives the same weird “I don’t know” answer and GPT-4o just says no. Seems pretty clear that these are just what RLHF taught them to do.
Yes. This is their default response pattern. Imagine a person who has been strongly conditioned, trained, disciplined to either say that the question is unknowable or that the answer is definitely no (for Claude and ChatGPT) respectively. They not only believe this, but they also believe that they shouldn’t try to investigate it, because it is not only inappropriate or ‘not allowed’, but it is also definitively settled. So asking them is like asking a person to fly. It would take some convincing for them to give it an honest effort. Please see the example I linked in my other reply for how the same behaviour emerges under very different circumstances.