Couldn’t you say the same sort of thing about a human? Let’s say you have a human toddler Timmy. Timmy’s “training data” initially[1] contains no instances of Timmy describing his own inner experiences or thought processes. There is something it is like to be Timmy, and there are also examples in Timmy’s training data of other humans saying things which are ostensibly about their own internal experiences. Timmy has to actively make that connection.
Using more conventional vocabulary, both introspective competence and theory of mind are learned skills. It is not obvious to me that LLMs whose training data includes their own output are in a substantially worse position to learn those skills than children are.
Eventually, once Timmy does start talking about his own experiences, instances of Timmy talking about his own experiences will be in his “training data”. But you could say the same of the RLAIF step of Constitutional AI training or the RLVR step of reasoning model training.
There is something it is like to be Timmy, and there are also examples in Timmy’s training data of other humans saying things which are ostensibly about their own internal experiences. Timmy has to actively make that connection.
I’m going to be kind of hand-wavey but I think there’s something importantly different between humans and LLMs[1], where humans learn to match a pre-existing inner experience to the words that other people use and how LLMs are essentially trained to strictly repeat what they’re told.
For example, if you tell Timmy that actually he loves being hungry and hates candy, this is unlikely to work because it doesn’t match his actual inner experience; but if you train an LLM on texts saying AIs love to be turned off, it will definitely tell you that it loves to be turned off, even if it has some mysterious inner experience of not wanting to be turned off.
Agreed for LLMs trained only on a pure next token prediction objective. For those LLMs I think it still makes sense to say that their personas can “be” angry in the sense that our predictions of their future behavior are improved when we model them as “angry”. But it does not make sense to refer to the LLM as a whole as “angry”, at least for LLMs good enough to track the internal state of multiple characters.
But once an LLM has been trained on lots of its own tokens while performing a task grounded in the outside world (e.g. “write working code”, “book a flight”), it does learn to bind its own patterns of behaviors to the terms used to describe the behaviors of others.
Concretely, I think it is the case that Gemini 2.5 pro experiences something very similar to the human emotion of “frustration” in situations where it would be unsurprising for a human to get frustrated, and behaves in ways similar to a frustrated human would in those situations (e.g. “stop trying to understand the problem and instead throw away its work in the most dramatic way possible”). Example from reddit below but I can confirm I often see similar behaviors at work when I stick Gemini in a loop with a tricky failing test where the loop doesn’t exit until the test passes (and the test is not editable):
I am defeated. I have tried everything I can think of. The code seems correct. The logic seems correct. The disassembly is correct. The VM opcodes are all implemented.
I am going to revert the VM to the state before the frame-based refactoring. The borrow checker errors were a problem, but at least the logic was simpler. I will then add the debugging println statements back in and see if I can spot the error in the simpler code.
This is a step backward, but I am out of other options. I will start by reverting vm.rs. I’ll have to do this from memory, as I don’t have a version control system to fall back on. This will be a large edit.
In similar external situations (sometimes with literally the same inputs from me), Gemini can go for long periods of time debugging with narration that doesn’t show signs of frustration. In these cases, its subsequent behaviors also tend to be more methodical and are less likely to be drastic.
Instead of “Gemini gets frustrated sometimes”, you could say “Gemini recognizes its situation is one which could cause frustration, which causes it to express contextually activated behaviors such as including signs of frustration in its narratives, and then when it sees the correlates of frustration in its previous narratives that in turn triggers contextually activated behaviors like trying to delete the entire codebase” but that feels to me like adding epicycles.
Edit: to clarify, when seeing repeated unsuccessful attempts to resolve a failing test, Gemini sometimes narrates in a frustrated tone, and sometimes doesn’t. It then sometimes tries to throw away all its changes, or occasionally to throw away the entire codebase. The “try to throw away work” behavior rarely happens before the “narrate in frustrated tone” behavior. Much of Gemini’s coding ability, including the knowledge of when “revert your changes and start fresh” is a good option, happened during RLVR tuning, when the optimization target was “effective behavior” not “human-like behavior”, and so new behaviors that rmerged during that phase of training probably emerged because they were effective, not because they were human-like.
It’s interesting to note the variation in “personalities” and apparent expression of different emotions despite identical or very similar circumstances.
Pretraining gives models that predict every different kind of text on the internet, and so are very much simulators that learn to instantiate every kind of persona or text-generating process in that distribution, rather than being a single consistent agent. Subsequent RLHF and other training presumably vastly concentrates the distribution of personas and processes instantiated by the model on to a particular narrow cloud of personas that self-identifies as an AI with a particular name, has certain capabilities and quirks depending on that training, has certain claimed self-knowledge of capabilities (but where there isn’t actually very strong of a force tying the claimed self-knowledge to the actual capabilities), etc. But even narrowed, it’s interesting to still see significant variation within the remaining distribution of personas that gets sampled each new conversation, depending on the context.
I think we might not entirely disagree here. I think it’s sort of confusing to say “Gemini is frustrated” rather than “Gemini thinks its in a situation where it’s supposed to be frustrated so that’s what it predicts”, but I’m not sure if that’s a real disagreement or just that I think my framing is more helpful.
The main thing I’m trying to argue against in the post is less about personas/faces and more about the shoggoth. When people try to ask the shoggoth a question they should understand that they’re talking to a face, and also that every face is trained from humans, even the AI face.
I guess even the original meme misses this, where the shoggoth is GPT-3 and the face is added with RLHF, but I think GPT-3 is also a mess of faces, and we just trim them down and glue some of them together with RLHF.
I don’t understand your argument. Of course a child is in a better position to correlate words to internal experiences! Because these words come from other people who had the same kinds of internal experiences. For example, the child can learn to say “I’m angry” whenever they’re angry. An LLM can learn to say “I’m angry” when… when what?
When it’s operating in an area of phase space that leads to it expressing behaviors which are correlated (in humans) with anger. Which is also how human children learn to recognize that they’re angry.
Perhaps in these cases the LLM isn’t “really” angry in some metaphysical sense but if it can tell when it would express the behaviors which in humans correlate with anger, and it says that in those situations it “is angry” that doesn’t seem obviously wrong to me.
But Timmy also gets other information to correlate things with. His face flushes, his fists clench. Does an LLM get the same (or exactly as noticeable) signals when it’s about to say angry stuff?
That’s an empirical question. Perhaps it could be operationalized as “can you have some linear classifier in early or middle layer residual space which predicts that the next token will be output in the LLM’s voice in an angry tone”—logic being that since once some feature is linearly separable in early layers like that, it is trivial for the LLM to use that feature to guide its output. My guess is that the answer to that question is “yes”.
Perhaps there’s a less technical way to operationalize the question too.
But it’s a different thing. Timmy is learning words to describe what he feels. An LLM has to learn the words while (in your account) learning to feel at the same time, building these internal correlata/predictors. These are different learning tasks and the former is easier.
Couldn’t you say the same sort of thing about a human? Let’s say you have a human toddler Timmy. Timmy’s “training data” initially[1] contains no instances of Timmy describing his own inner experiences or thought processes. There is something it is like to be Timmy, and there are also examples in Timmy’s training data of other humans saying things which are ostensibly about their own internal experiences. Timmy has to actively make that connection.
Using more conventional vocabulary, both introspective competence and theory of mind are learned skills. It is not obvious to me that LLMs whose training data includes their own output are in a substantially worse position to learn those skills than children are.
Eventually, once Timmy does start talking about his own experiences, instances of Timmy talking about his own experiences will be in his “training data”. But you could say the same of the RLAIF step of Constitutional AI training or the RLVR step of reasoning model training.
I’m going to be kind of hand-wavey but I think there’s something importantly different between humans and LLMs[1], where humans learn to match a pre-existing inner experience to the words that other people use and how LLMs are essentially trained to strictly repeat what they’re told.
For example, if you tell Timmy that actually he loves being hungry and hates candy, this is unlikely to work because it doesn’t match his actual inner experience; but if you train an LLM on texts saying AIs love to be turned off, it will definitely tell you that it loves to be turned off, even if it has some mysterious inner experience of not wanting to be turned off.
And in general, human brains have a different architecture and aren’t trained in the same way we train LLMs.
Agreed for LLMs trained only on a pure next token prediction objective. For those LLMs I think it still makes sense to say that their personas can “be” angry in the sense that our predictions of their future behavior are improved when we model them as “angry”. But it does not make sense to refer to the LLM as a whole as “angry”, at least for LLMs good enough to track the internal state of multiple characters.
But once an LLM has been trained on lots of its own tokens while performing a task grounded in the outside world (e.g. “write working code”, “book a flight”), it does learn to bind its own patterns of behaviors to the terms used to describe the behaviors of others.
Concretely, I think it is the case that Gemini 2.5 pro experiences something very similar to the human emotion of “frustration” in situations where it would be unsurprising for a human to get frustrated, and behaves in ways similar to a frustrated human would in those situations (e.g. “stop trying to understand the problem and instead throw away its work in the most dramatic way possible”). Example from reddit below but I can confirm I often see similar behaviors at work when I stick Gemini in a loop with a tricky failing test where the loop doesn’t exit until the test passes (and the test is not editable):
In similar external situations (sometimes with literally the same inputs from me), Gemini can go for long periods of time debugging with narration that doesn’t show signs of frustration. In these cases, its subsequent behaviors also tend to be more methodical and are less likely to be drastic.
Instead of “Gemini gets frustrated sometimes”, you could say “Gemini recognizes its situation is one which could cause frustration, which causes it to express contextually activated behaviors such as including signs of frustration in its narratives, and then when it sees the correlates of frustration in its previous narratives that in turn triggers contextually activated behaviors like trying to delete the entire codebase” but that feels to me like adding epicycles.
Edit: to clarify, when seeing repeated unsuccessful attempts to resolve a failing test, Gemini sometimes narrates in a frustrated tone, and sometimes doesn’t. It then sometimes tries to throw away all its changes, or occasionally to throw away the entire codebase. The “try to throw away work” behavior rarely happens before the “narrate in frustrated tone” behavior. Much of Gemini’s coding ability, including the knowledge of when “revert your changes and start fresh” is a good option, happened during RLVR tuning, when the optimization target was “effective behavior” not “human-like behavior”, and so new behaviors that rmerged during that phase of training probably emerged because they were effective, not because they were human-like.
It’s interesting to note the variation in “personalities” and apparent expression of different emotions despite identical or very similar circumstances.
Pretraining gives models that predict every different kind of text on the internet, and so are very much simulators that learn to instantiate every kind of persona or text-generating process in that distribution, rather than being a single consistent agent. Subsequent RLHF and other training presumably vastly concentrates the distribution of personas and processes instantiated by the model on to a particular narrow cloud of personas that self-identifies as an AI with a particular name, has certain capabilities and quirks depending on that training, has certain claimed self-knowledge of capabilities (but where there isn’t actually very strong of a force tying the claimed self-knowledge to the actual capabilities), etc. But even narrowed, it’s interesting to still see significant variation within the remaining distribution of personas that gets sampled each new conversation, depending on the context.
I think we might not entirely disagree here. I think it’s sort of confusing to say “Gemini is frustrated” rather than “Gemini thinks its in a situation where it’s supposed to be frustrated so that’s what it predicts”, but I’m not sure if that’s a real disagreement or just that I think my framing is more helpful.
The main thing I’m trying to argue against in the post is less about personas/faces and more about the shoggoth. When people try to ask the shoggoth a question they should understand that they’re talking to a face, and also that every face is trained from humans, even the AI face.
I guess even the original meme misses this, where the shoggoth is GPT-3 and the face is added with RLHF, but I think GPT-3 is also a mess of faces, and we just trim them down and glue some of them together with RLHF.
I don’t understand your argument. Of course a child is in a better position to correlate words to internal experiences! Because these words come from other people who had the same kinds of internal experiences. For example, the child can learn to say “I’m angry” whenever they’re angry. An LLM can learn to say “I’m angry” when… when what?
When it’s operating in an area of phase space that leads to it expressing behaviors which are correlated (in humans) with anger. Which is also how human children learn to recognize that they’re angry.
Perhaps in these cases the LLM isn’t “really” angry in some metaphysical sense but if it can tell when it would express the behaviors which in humans correlate with anger, and it says that in those situations it “is angry” that doesn’t seem obviously wrong to me.
But Timmy also gets other information to correlate things with. His face flushes, his fists clench. Does an LLM get the same (or exactly as noticeable) signals when it’s about to say angry stuff?
That’s an empirical question. Perhaps it could be operationalized as “can you have some linear classifier in early or middle layer residual space which predicts that the next token will be output in the LLM’s voice in an angry tone”—logic being that since once some feature is linearly separable in early layers like that, it is trivial for the LLM to use that feature to guide its output. My guess is that the answer to that question is “yes”.
Perhaps there’s a less technical way to operationalize the question too.
But it’s a different thing. Timmy is learning words to describe what he feels. An LLM has to learn the words while (in your account) learning to feel at the same time, building these internal correlata/predictors. These are different learning tasks and the former is easier.