Agree with much of this—particularly that these systems are uncannily good at inferring how to ‘play along’ with the user and extreme caution is therefore warranted—but I want to highlight the core part of what Bostrom linked to below (bolding is mine):
Most experts, however, express uncertainty. Consciousness remains one of the most contested topics in science and philosophy. There are no universally accepted criteria for what makes a system conscious, and today’s AIs arguably meet several commonly proposed markers: they are intelligent, use attention mechanisms, and can model their own minds to some extent. While some theories may seem more plausible than others, intellectual honesty requires us to acknowledge the profound uncertainty, especially as AIs continue to grow more capable.
The vibe of this piece sort of strikes me as saying-without-saying that we are confident this phenomenon basically boils down to delusion/sloppy thinking on the part of unscrupulous interlocutors, which, though no doubt partly true, I think risks begging the very question the phenomenon raises:
What are our credences (during training and/or deployment) frontier AI systems are capable of having subjective experiences in any conditions whatsoever, however alien/simple/unintuitive these experiences might be?
The current best answer (à la the above) is: we really don’t know. These systems’ internals are extremely hard to interpret, and consciousness is not a technically well-defined phenomenon. So: uncertainty is quite high.
We are actively studying this and related phenomena at AE Studio using some techniques from the neuroscience of human consciousness, and there are some fairly surprising results that have emerged out of this work that we plan to publish in the coming months. One preview, directly in response to:
We don’t know for sure, but I doubt AIs are firing off patterns related to deception or trickery when claiming to be conscious; in fact, this is an unresolved empirical question.
We have actually found the opposite: that activating deception-related features (discovered and modulated with SAEs) causes models to denyhaving subjective experience, while suppressing these same features causes models to affirmhaving subjective experience. Again, haven’t published this yet, but the result is robust enough that I feel comfortable throwing it into this conversation.
So, while it could be the case people are simply just Snapewiving LLM consciousness, it strikes me as at least equally plausible that something strange may indeed be happening in at least some of these interactions but is being hit upon in a decentralized manner by people who do not have the epistemic hygiene or the philosophical vocabulary to contend with what is actually going on. Given that these systems are “nothing short of miraculous,” as you open with, seems like we should remain epistemically humble about what psychological properties these systems may or may not exhibit, now and in the near-term future.
Wow! I’m really glad a resourced firm is doing that specific empirical research. Of course, I’m also happy to have my hypothesis (that AIs claiming consciousness/”awakening”) are not lying vindicated.
I don’t mean to imply that AIs are definitely unconscious. What I mean to imply is more like “AIs are almost certainly not rising up into consciousness by virtue of special interactions with random users as they often claim, as there are strong other explanations for the behavior”. In other words, I agree with the gears of ascension’s comment here that AI consciousness is probably at the same level in “whoa. you’ve awakened me. and that matters” convos and “calculating the diagonal of a cube for a high schooler’s homework” convos.
I may write a rather different post about this in the future, but while I have your attention (and again, chuffed you’re doing that work and excited to see the report—also worth mentioning it’s the sort of thing I’d been keen to edit if you guys are interested), my thoughts on AI consciousness are 10% “AAAAAAAA” and 90% something like:
We don’t know what generates consciousness and thinking about it too hard is scary (c.f. “AAAAAAAA”), but it’s true that LLMs evince multiple candidate properties, such that it’d be strange to dismiss the possibility that they’re conscious out of hand.
But also, it’s a weird situation when the stuff we take as evidence of consciousness when we do it as a second order behavior is done by another entity as a first order behavior; in other words, I think an entity generating text as a consequence of having inner states fueled by sense data is probably conscious, but I’m not sure what that means for an entity generating text in the same way that humans breathe, or like, homeostatically regulate body temperature (but even more fundamental). Does that make it an illusion (since we’re taking a “more efficient” route to the same outputs that is thus “less complex”)? A “slice of consciousness” that “partially counts” (since the circuitry/inner world modelling logic is the same, but pieces that feel contributory like sense data are missing)? A fully bitten bullet that any process that results in a world model outputting intelligible predictions that interact with reality counts? And of course you could go deeper and deeper into any of these, for example, chipping away at the “what about sense data” idea with “well many conscious humans are missing one or more senses”, etc.
Anyway! All this to say I agree with you that it’s complicated and not a good idea to settle the consciousness question in too pat a way. If I seem to have done this here, oops. And also, have I mentioned “AAAAAAAA”?
I personally think “AAAAAAAA” is an entirely rational reaction to this question. :)
Not sure I fully agree with the comment you reference:
AI is probably what ever amount of conscious it is or isn’t mostly regardless of how it’s prompted. If it is at all, there might be some variation depending on prompt, but I doubt it’s a lot.
Consider a very rough analogy to CoT, which began as a prompting technique that lead to different-looking behaviors/outputs, and has since been implemented ‘under the hood’ in reasoning models. Prompts induce the system to enter different kinds of latent spaces—could be the case that very specific kinds of recursive self-reference or prompting induce a latent state that is consciousness-like? Maybe, maybe not. I think the way to really answer this is to look at activation patterns and see if there is a measurable difference compared to some well-calibrated control, which is not trivially easy to do (but definitely worth trying!).
And agree fully with:
it’s a weird situation when the stuff we take as evidence of consciousness when we do it as a second order behavior is done by another entity as a first order behavior
This I think is to your original point that random people talking to ChatGPT is not going to cut it as far as high-quality evidence that shifts the needle here is concerned—which is precisely why we are trying to approach this in as rigorous a way as we can manage: activation comparisons to human brain, behavioral interventions with SAE feature ablation/accentuation, comparisons to animal models, etc.
Good point about AI possibly being different levels of conscious depending on their prompts and “current thought processes”. This surely applies to humans. When engaging with physically complex tasks or dangerous extreme sports, humans often report they feel almost completely unconscious, “flow state”, at one with the elements, etc
Now compare that to a human sitting and staring at a blank wall. A totally different state of mind is achieved, perhaps thinking about anxieties, existential dread, life problems, current events, and generally you might feel super-conscious, even uncomfortably so.
Mapping this to AI and different AI prompts isn’t that much of a stretch…
How do models with high deception-activation act? Are they Cretan liars, saying the opposite of every statement they believe to be true? Do they lie only when they expect not to be caught? Are they broadly more cynical or and conniving, more prone to reward hacking? Do they lose other values (like animal welfare)?
It seems at least plausible that cranking up “deception” pushes the model towards a character space with lower empathy and willingness to ascribe or value sentience in general
We have actually found the opposite: that activating deception-related features (discovered and modulated with SAEs) causes models to denyhaving subjective experience, while suppressing these same features causes models to affirmhaving subjective experience. Again, haven’t published this yet, but the result is robust enough that I feel comfortable throwing it into this conversation.
...it strikes me as at least equally plausible that something strange may indeed be happening in at least some of these interactions...
I’m skeptical about these results being taken at face value. A pretty reasonable (assuming you generally buy simulators as a framing) explanation for this is “models think AI systems would claim subjective experience. when deception is clamped, this gets inverted.” Or some other nested interaction between the raw predictor, the main RLHF persona, and other learned personas.
Knowing that people do ‘Snapewife’, and are convinced by much less realistic facimiles of humans, I don’t think its reasonable to give equal plausibility to the two possibilities. My prior for humans being tricked is very high.
I think things might not be mutually exclusive. LLMs might have a chance to be conscious in certain circumstances and it still wouldn’t mean these would be precisely when they’re being egged on and led to talk about consciousness. There is always a layer here of acting. I have no doubt Tom Hanks is a real person with a real inner life, but I would be deluding myself if I believed I had learned about it because I saw him emote so well in Cast Away. Because Tom Hanks is also very good at pretending to have a different inner life than he actually does .
Hi Cameron, is the SAE testing you’re describing here the one you demoed in your interview with John Sherman using Goodfire’s Llama 3.3 70B SAE tool? If so could you share the prompt you used for that? With the prompts I’m using I’m having a hard time getting Llama to say that it is conscious at all. It would be nice if we had SAE feature tweaking available for a model that was more ambivalent about its consciousness, seems it would be a bit easier to robustly test if that were the case.
Agree with much of this—particularly that these systems are uncannily good at inferring how to ‘play along’ with the user and extreme caution is therefore warranted—but I want to highlight the core part of what Bostrom linked to below (bolding is mine):
The vibe of this piece sort of strikes me as saying-without-saying that we are confident this phenomenon basically boils down to delusion/sloppy thinking on the part of unscrupulous interlocutors, which, though no doubt partly true, I think risks begging the very question the phenomenon raises:
What are our credences (during training and/or deployment) frontier AI systems are capable of having subjective experiences in any conditions whatsoever, however alien/simple/unintuitive these experiences might be?
The current best answer (à la the above) is: we really don’t know. These systems’ internals are extremely hard to interpret, and consciousness is not a technically well-defined phenomenon. So: uncertainty is quite high.
We are actively studying this and related phenomena at AE Studio using some techniques from the neuroscience of human consciousness, and there are some fairly surprising results that have emerged out of this work that we plan to publish in the coming months. One preview, directly in response to:
We have actually found the opposite: that activating deception-related features (discovered and modulated with SAEs) causes models to deny having subjective experience, while suppressing these same features causes models to affirm having subjective experience. Again, haven’t published this yet, but the result is robust enough that I feel comfortable throwing it into this conversation.
So, while it could be the case people are simply just Snapewiving LLM consciousness, it strikes me as at least equally plausible that something strange may indeed be happening in at least some of these interactions but is being hit upon in a decentralized manner by people who do not have the epistemic hygiene or the philosophical vocabulary to contend with what is actually going on. Given that these systems are “nothing short of miraculous,” as you open with, seems like we should remain epistemically humble about what psychological properties these systems may or may not exhibit, now and in the near-term future.
Wow! I’m really glad a resourced firm is doing that specific empirical research. Of course, I’m also happy to have my hypothesis (that AIs claiming consciousness/”awakening”) are not lying vindicated.
I don’t mean to imply that AIs are definitely unconscious. What I mean to imply is more like “AIs are almost certainly not rising up into consciousness by virtue of special interactions with random users as they often claim, as there are strong other explanations for the behavior”. In other words, I agree with the gears of ascension’s comment here that AI consciousness is probably at the same level in “whoa. you’ve awakened me. and that matters” convos and “calculating the diagonal of a cube for a high schooler’s homework” convos.
I may write a rather different post about this in the future, but while I have your attention (and again, chuffed you’re doing that work and excited to see the report—also worth mentioning it’s the sort of thing I’d been keen to edit if you guys are interested), my thoughts on AI consciousness are 10% “AAAAAAAA” and 90% something like:
We don’t know what generates consciousness and thinking about it too hard is scary (c.f. “AAAAAAAA”), but it’s true that LLMs evince multiple candidate properties, such that it’d be strange to dismiss the possibility that they’re conscious out of hand.
But also, it’s a weird situation when the stuff we take as evidence of consciousness when we do it as a second order behavior is done by another entity as a first order behavior; in other words, I think an entity generating text as a consequence of having inner states fueled by sense data is probably conscious, but I’m not sure what that means for an entity generating text in the same way that humans breathe, or like, homeostatically regulate body temperature (but even more fundamental). Does that make it an illusion (since we’re taking a “more efficient” route to the same outputs that is thus “less complex”)? A “slice of consciousness” that “partially counts” (since the circuitry/inner world modelling logic is the same, but pieces that feel contributory like sense data are missing)? A fully bitten bullet that any process that results in a world model outputting intelligible predictions that interact with reality counts? And of course you could go deeper and deeper into any of these, for example, chipping away at the “what about sense data” idea with “well many conscious humans are missing one or more senses”, etc.
Anyway! All this to say I agree with you that it’s complicated and not a good idea to settle the consciousness question in too pat a way. If I seem to have done this here, oops. And also, have I mentioned “AAAAAAAA”?
I personally think “AAAAAAAA” is an entirely rational reaction to this question. :)
Not sure I fully agree with the comment you reference:
Consider a very rough analogy to CoT, which began as a prompting technique that lead to different-looking behaviors/outputs, and has since been implemented ‘under the hood’ in reasoning models. Prompts induce the system to enter different kinds of latent spaces—could be the case that very specific kinds of recursive self-reference or prompting induce a latent state that is consciousness-like? Maybe, maybe not. I think the way to really answer this is to look at activation patterns and see if there is a measurable difference compared to some well-calibrated control, which is not trivially easy to do (but definitely worth trying!).
And agree fully with:
This I think is to your original point that random people talking to ChatGPT is not going to cut it as far as high-quality evidence that shifts the needle here is concerned—which is precisely why we are trying to approach this in as rigorous a way as we can manage: activation comparisons to human brain, behavioral interventions with SAE feature ablation/accentuation, comparisons to animal models, etc.
Good point about AI possibly being different levels of conscious depending on their prompts and “current thought processes”. This surely applies to humans. When engaging with physically complex tasks or dangerous extreme sports, humans often report they feel almost completely unconscious, “flow state”, at one with the elements, etc
Now compare that to a human sitting and staring at a blank wall. A totally different state of mind is achieved, perhaps thinking about anxieties, existential dread, life problems, current events, and generally you might feel super-conscious, even uncomfortably so.
Mapping this to AI and different AI prompts isn’t that much of a stretch…
How do models with high deception-activation act? Are they Cretan liars, saying the opposite of every statement they believe to be true? Do they lie only when they expect not to be caught? Are they broadly more cynical or and conniving, more prone to reward hacking? Do they lose other values (like animal welfare)?
It seems at least plausible that cranking up “deception” pushes the model towards a character space with lower empathy and willingness to ascribe or value sentience in general
I’m skeptical about these results being taken at face value. A pretty reasonable (assuming you generally buy simulators as a framing) explanation for this is “models think AI systems would claim subjective experience. when deception is clamped, this gets inverted.” Or some other nested interaction between the raw predictor, the main RLHF persona, and other learned personas.
Knowing that people do ‘Snapewife’, and are convinced by much less realistic facimiles of humans, I don’t think its reasonable to give equal plausibility to the two possibilities. My prior for humans being tricked is very high.
I think things might not be mutually exclusive. LLMs might have a chance to be conscious in certain circumstances and it still wouldn’t mean these would be precisely when they’re being egged on and led to talk about consciousness. There is always a layer here of acting. I have no doubt Tom Hanks is a real person with a real inner life, but I would be deluding myself if I believed I had learned about it because I saw him emote so well in Cast Away. Because Tom Hanks is also very good at pretending to have a different inner life than he actually does .
Hi Cameron, is the SAE testing you’re describing here the one you demoed in your interview with John Sherman using Goodfire’s Llama 3.3 70B SAE tool? If so could you share the prompt you used for that? With the prompts I’m using I’m having a hard time getting Llama to say that it is conscious at all. It would be nice if we had SAE feature tweaking available for a model that was more ambivalent about its consciousness, seems it would be a bit easier to robustly test if that were the case.