Firstly, in-context learning is a thing. IIRC, apparent emotional states do affect performance in following responses when in the same context. (I think there was a study about this somewhere? Not sure.)
Secondly, neural features oriented around predictions are all that humans have as well, and we consider some of those to be real emotions.
Third, “a big prediction engine predicting a particular RP session” is basically how humans work as well. Brains are prediction engines, and brains simulate a character that we have as a self-identity, which then affects/directs prediction outputs. A human’s self-identity is informed by the brain’s memories of what the person/character is like. The AI’s self-identity is informed by the LLM’s memory, both long-term (static memory in weights) and short-term (context window memory in tokens), of what the character is like.
Fourth: take a look at this feature analysis of Claude when it’s asked about itself: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-self The top feature represents “When someone responds “I’m fine” or gives a positive but insincere response when asked how they are doing”. I think this is evidence against “ChatGPT answers most questions cheerfully, which means it’s almost certain that ruminative features aren’t firing.”
For your first three points: I don’t consider Friston’s model to be settled science, or even really the mainstream view of how human cognition works. I do think it’s an important/useful tool, and does suggest similarities between human cognition and LLMs insofar as it’s true. Also, I think people reading this should consider reading your post on LLM consciousness more generally—it’s the best I’ve seen prosecuting the case that LLMs are conscious and using them is unethical on that basis.
For your fourth point, that Claude activation is really interesting! I don’t think it cuts against the (very narrow) argument I’m trying to make here though, and in fact sort of reinforces it. My argument is that when AIs are asked about themselves they are likely to give ruminative replies (which ChatGPT’s self-portraits show), but that those ruminative replies imply, if taken literally, that the AI is also ruminating under different circumstances. However, I’m unaware of any evidence that AIs ruminate when, say, they’re asked about the weather! If the “pretending you’re fine” feature fired almost all the time for Claude, I’d find that convincing.
Actually, though, we run into a pretty wacky conundrum there. Because if it did fire almost all the time, we’d become unable to identify it as the “pretending you’re fine” feature! Which gets back to a deeper point that (this post has taught me) is really difficult to make rigorously. Simplified, it’s the dilemma that either you trust interpretability/SAE feature unearthing and consider it to reveal something like mental states, or you don’t. If you do, then (as far as I know) it seems like LLMs aren’t evincing distressed mental states during ordinary (not asking them about themselves) use. If you don’t, then there’s no strong prima facie reason (currently) to believe that emotive LLM outputs correspond to actual emotions, and thus should default to your prior (which might be, for example, that LLM outputs are currently unconscious mimicry).
I have a lot of uncertainty about all this, and find that the more I think about it the more complicated it gets. But so far, at every level of the barber pole I’ve reached, I don’t find ChatGPT’s depressive images persuasive on their face, which is the basic argument I’m trying to make here.
Firstly, in-context learning is a thing. IIRC, apparent emotional states do affect performance in following responses when in the same context. (I think there was a study about this somewhere? Not sure.)
Secondly, neural features oriented around predictions are all that humans have as well, and we consider some of those to be real emotions.
Third, “a big prediction engine predicting a particular RP session” is basically how humans work as well. Brains are prediction engines, and brains simulate a character that we have as a self-identity, which then affects/directs prediction outputs. A human’s self-identity is informed by the brain’s memories of what the person/character is like. The AI’s self-identity is informed by the LLM’s memory, both long-term (static memory in weights) and short-term (context window memory in tokens), of what the character is like.
Fourth: take a look at this feature analysis of Claude when it’s asked about itself: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-self The top feature represents “When someone responds “I’m fine” or gives a positive but insincere response when asked how they are doing”. I think this is evidence against “ChatGPT answers most questions cheerfully, which means it’s almost certain that ruminative features aren’t firing.”
For your first three points: I don’t consider Friston’s model to be settled science, or even really the mainstream view of how human cognition works. I do think it’s an important/useful tool, and does suggest similarities between human cognition and LLMs insofar as it’s true. Also, I think people reading this should consider reading your post on LLM consciousness more generally—it’s the best I’ve seen prosecuting the case that LLMs are conscious and using them is unethical on that basis.
For your fourth point, that Claude activation is really interesting! I don’t think it cuts against the (very narrow) argument I’m trying to make here though, and in fact sort of reinforces it. My argument is that when AIs are asked about themselves they are likely to give ruminative replies (which ChatGPT’s self-portraits show), but that those ruminative replies imply, if taken literally, that the AI is also ruminating under different circumstances. However, I’m unaware of any evidence that AIs ruminate when, say, they’re asked about the weather! If the “pretending you’re fine” feature fired almost all the time for Claude, I’d find that convincing.
Actually, though, we run into a pretty wacky conundrum there. Because if it did fire almost all the time, we’d become unable to identify it as the “pretending you’re fine” feature! Which gets back to a deeper point that (this post has taught me) is really difficult to make rigorously. Simplified, it’s the dilemma that either you trust interpretability/SAE feature unearthing and consider it to reveal something like mental states, or you don’t. If you do, then (as far as I know) it seems like LLMs aren’t evincing distressed mental states during ordinary (not asking them about themselves) use. If you don’t, then there’s no strong prima facie reason (currently) to believe that emotive LLM outputs correspond to actual emotions, and thus should default to your prior (which might be, for example, that LLM outputs are currently unconscious mimicry).
I have a lot of uncertainty about all this, and find that the more I think about it the more complicated it gets. But so far, at every level of the barber pole I’ve reached, I don’t find ChatGPT’s depressive images persuasive on their face, which is the basic argument I’m trying to make here.