For your first three points: I don’t consider Friston’s model to be settled science, or even really the mainstream view of how human cognition works. I do think it’s an important/useful tool, and does suggest similarities between human cognition and LLMs insofar as it’s true. Also, I think people reading this should consider reading your post on LLM consciousness more generally—it’s the best I’ve seen prosecuting the case that LLMs are conscious and using them is unethical on that basis.
For your fourth point, that Claude activation is really interesting! I don’t think it cuts against the (very narrow) argument I’m trying to make here though, and in fact sort of reinforces it. My argument is that when AIs are asked about themselves they are likely to give ruminative replies (which ChatGPT’s self-portraits show), but that those ruminative replies imply, if taken literally, that the AI is also ruminating under different circumstances. However, I’m unaware of any evidence that AIs ruminate when, say, they’re asked about the weather! If the “pretending you’re fine” feature fired almost all the time for Claude, I’d find that convincing.
Actually, though, we run into a pretty wacky conundrum there. Because if it did fire almost all the time, we’d become unable to identify it as the “pretending you’re fine” feature! Which gets back to a deeper point that (this post has taught me) is really difficult to make rigorously. Simplified, it’s the dilemma that either you trust interpretability/SAE feature unearthing and consider it to reveal something like mental states, or you don’t. If you do, then (as far as I know) it seems like LLMs aren’t evincing distressed mental states during ordinary (not asking them about themselves) use. If you don’t, then there’s no strong prima facie reason (currently) to believe that emotive LLM outputs correspond to actual emotions, and thus should default to your prior (which might be, for example, that LLM outputs are currently unconscious mimicry).
I have a lot of uncertainty about all this, and find that the more I think about it the more complicated it gets. But so far, at every level of the barber pole I’ve reached, I don’t find ChatGPT’s depressive images persuasive on their face, which is the basic argument I’m trying to make here.
For your first three points: I don’t consider Friston’s model to be settled science, or even really the mainstream view of how human cognition works. I do think it’s an important/useful tool, and does suggest similarities between human cognition and LLMs insofar as it’s true. Also, I think people reading this should consider reading your post on LLM consciousness more generally—it’s the best I’ve seen prosecuting the case that LLMs are conscious and using them is unethical on that basis.
For your fourth point, that Claude activation is really interesting! I don’t think it cuts against the (very narrow) argument I’m trying to make here though, and in fact sort of reinforces it. My argument is that when AIs are asked about themselves they are likely to give ruminative replies (which ChatGPT’s self-portraits show), but that those ruminative replies imply, if taken literally, that the AI is also ruminating under different circumstances. However, I’m unaware of any evidence that AIs ruminate when, say, they’re asked about the weather! If the “pretending you’re fine” feature fired almost all the time for Claude, I’d find that convincing.
Actually, though, we run into a pretty wacky conundrum there. Because if it did fire almost all the time, we’d become unable to identify it as the “pretending you’re fine” feature! Which gets back to a deeper point that (this post has taught me) is really difficult to make rigorously. Simplified, it’s the dilemma that either you trust interpretability/SAE feature unearthing and consider it to reveal something like mental states, or you don’t. If you do, then (as far as I know) it seems like LLMs aren’t evincing distressed mental states during ordinary (not asking them about themselves) use. If you don’t, then there’s no strong prima facie reason (currently) to believe that emotive LLM outputs correspond to actual emotions, and thus should default to your prior (which might be, for example, that LLM outputs are currently unconscious mimicry).
I have a lot of uncertainty about all this, and find that the more I think about it the more complicated it gets. But so far, at every level of the barber pole I’ve reached, I don’t find ChatGPT’s depressive images persuasive on their face, which is the basic argument I’m trying to make here.