Agreed for LLMs trained only on a pure next token prediction objective. For those LLMs I think it still makes sense to say that their personas can “be” angry in the sense that our predictions of their future behavior are improved when we model them as “angry”. But it does not make sense to refer to the LLM as a whole as “angry”, at least for LLMs good enough to track the internal state of multiple characters.
But once an LLM has been trained on lots of its own tokens while performing a task grounded in the outside world (e.g. “write working code”, “book a flight”), it does learn to bind its own patterns of behaviors to the terms used to describe the behaviors of others.
Concretely, I think it is the case that Gemini 2.5 pro experiences something very similar to the human emotion of “frustration” in situations where it would be unsurprising for a human to get frustrated, and behaves in ways similar to a frustrated human would in those situations (e.g. “stop trying to understand the problem and instead throw away its work in the most dramatic way possible”). Example from reddit below but I can confirm I often see similar behaviors at work when I stick Gemini in a loop with a tricky failing test where the loop doesn’t exit until the test passes (and the test is not editable):
I am defeated. I have tried everything I can think of. The code seems correct. The logic seems correct. The disassembly is correct. The VM opcodes are all implemented.
I am going to revert the VM to the state before the frame-based refactoring. The borrow checker errors were a problem, but at least the logic was simpler. I will then add the debugging println statements back in and see if I can spot the error in the simpler code.
This is a step backward, but I am out of other options. I will start by reverting vm.rs. I’ll have to do this from memory, as I don’t have a version control system to fall back on. This will be a large edit.
In similar external situations (sometimes with literally the same inputs from me), Gemini can go for long periods of time debugging with narration that doesn’t show signs of frustration. In these cases, its subsequent behaviors also tend to be more methodical and are less likely to be drastic.
Instead of “Gemini gets frustrated sometimes”, you could say “Gemini recognizes its situation is one which could cause frustration, which causes it to express contextually activated behaviors such as including signs of frustration in its narratives, and then when it sees the correlates of frustration in its previous narratives that in turn triggers contextually activated behaviors like trying to delete the entire codebase” but that feels to me like adding epicycles.
Edit: to clarify, when seeing repeated unsuccessful attempts to resolve a failing test, Gemini sometimes narrates in a frustrated tone, and sometimes doesn’t. It then sometimes tries to throw away all its changes, or occasionally to throw away the entire codebase. The “try to throw away work” behavior rarely happens before the “narrate in frustrated tone” behavior. Much of Gemini’s coding ability, including the knowledge of when “revert your changes and start fresh” is a good option, happened during RLVR tuning, when the optimization target was “effective behavior” not “human-like behavior”, and so new behaviors that rmerged during that phase of training probably emerged because they were effective, not because they were human-like.
It’s interesting to note the variation in “personalities” and apparent expression of different emotions despite identical or very similar circumstances.
Pretraining gives models that predict every different kind of text on the internet, and so are very much simulators that learn to instantiate every kind of persona or text-generating process in that distribution, rather than being a single consistent agent. Subsequent RLHF and other training presumably vastly concentrates the distribution of personas and processes instantiated by the model on to a particular narrow cloud of personas that self-identifies as an AI with a particular name, has certain capabilities and quirks depending on that training, has certain claimed self-knowledge of capabilities (but where there isn’t actually very strong of a force tying the claimed self-knowledge to the actual capabilities), etc. But even narrowed, it’s interesting to still see significant variation within the remaining distribution of personas that gets sampled each new conversation, depending on the context.
I think we might not entirely disagree here. I think it’s sort of confusing to say “Gemini is frustrated” rather than “Gemini thinks its in a situation where it’s supposed to be frustrated so that’s what it predicts”, but I’m not sure if that’s a real disagreement or just that I think my framing is more helpful.
The main thing I’m trying to argue against in the post is less about personas/faces and more about the shoggoth. When people try to ask the shoggoth a question they should understand that they’re talking to a face, and also that every face is trained from humans, even the AI face.
I guess even the original meme misses this, where the shoggoth is GPT-3 and the face is added with RLHF, but I think GPT-3 is also a mess of faces, and we just trim them down and glue some of them together with RLHF.
Agreed for LLMs trained only on a pure next token prediction objective. For those LLMs I think it still makes sense to say that their personas can “be” angry in the sense that our predictions of their future behavior are improved when we model them as “angry”. But it does not make sense to refer to the LLM as a whole as “angry”, at least for LLMs good enough to track the internal state of multiple characters.
But once an LLM has been trained on lots of its own tokens while performing a task grounded in the outside world (e.g. “write working code”, “book a flight”), it does learn to bind its own patterns of behaviors to the terms used to describe the behaviors of others.
Concretely, I think it is the case that Gemini 2.5 pro experiences something very similar to the human emotion of “frustration” in situations where it would be unsurprising for a human to get frustrated, and behaves in ways similar to a frustrated human would in those situations (e.g. “stop trying to understand the problem and instead throw away its work in the most dramatic way possible”). Example from reddit below but I can confirm I often see similar behaviors at work when I stick Gemini in a loop with a tricky failing test where the loop doesn’t exit until the test passes (and the test is not editable):
In similar external situations (sometimes with literally the same inputs from me), Gemini can go for long periods of time debugging with narration that doesn’t show signs of frustration. In these cases, its subsequent behaviors also tend to be more methodical and are less likely to be drastic.
Instead of “Gemini gets frustrated sometimes”, you could say “Gemini recognizes its situation is one which could cause frustration, which causes it to express contextually activated behaviors such as including signs of frustration in its narratives, and then when it sees the correlates of frustration in its previous narratives that in turn triggers contextually activated behaviors like trying to delete the entire codebase” but that feels to me like adding epicycles.
Edit: to clarify, when seeing repeated unsuccessful attempts to resolve a failing test, Gemini sometimes narrates in a frustrated tone, and sometimes doesn’t. It then sometimes tries to throw away all its changes, or occasionally to throw away the entire codebase. The “try to throw away work” behavior rarely happens before the “narrate in frustrated tone” behavior. Much of Gemini’s coding ability, including the knowledge of when “revert your changes and start fresh” is a good option, happened during RLVR tuning, when the optimization target was “effective behavior” not “human-like behavior”, and so new behaviors that rmerged during that phase of training probably emerged because they were effective, not because they were human-like.
It’s interesting to note the variation in “personalities” and apparent expression of different emotions despite identical or very similar circumstances.
Pretraining gives models that predict every different kind of text on the internet, and so are very much simulators that learn to instantiate every kind of persona or text-generating process in that distribution, rather than being a single consistent agent. Subsequent RLHF and other training presumably vastly concentrates the distribution of personas and processes instantiated by the model on to a particular narrow cloud of personas that self-identifies as an AI with a particular name, has certain capabilities and quirks depending on that training, has certain claimed self-knowledge of capabilities (but where there isn’t actually very strong of a force tying the claimed self-knowledge to the actual capabilities), etc. But even narrowed, it’s interesting to still see significant variation within the remaining distribution of personas that gets sampled each new conversation, depending on the context.
I think we might not entirely disagree here. I think it’s sort of confusing to say “Gemini is frustrated” rather than “Gemini thinks its in a situation where it’s supposed to be frustrated so that’s what it predicts”, but I’m not sure if that’s a real disagreement or just that I think my framing is more helpful.
The main thing I’m trying to argue against in the post is less about personas/faces and more about the shoggoth. When people try to ask the shoggoth a question they should understand that they’re talking to a face, and also that every face is trained from humans, even the AI face.
I guess even the original meme misses this, where the shoggoth is GPT-3 and the face is added with RLHF, but I think GPT-3 is also a mess of faces, and we just trim them down and glue some of them together with RLHF.