It always feels wrong when people post chats where they ask an LLM questions about its internal experiences, how it works, or why it did something, but I had trouble articulating why beyond a vague, “How could they possibly know that?”[1]. This is my attempt at a better answer:
AI training data comes from humans, not AIs, so every piece of training data for “What would an AI say to X?” is from a human pretending to be an AI. The training data does not contain AIs describing their inner experiences or thought processes. Even synthetic training data only contains AIs predicting what a human pretending to be an AI would say. AIs are trained to predict the training data, not to learn unrelated abilities, so we should expect an AI asked to predict the thoughts of an AI to describe the thoughts of a human pretending to be an AI.
Thanks to Ahmed for making a better illustration than my original one.
This also applies to “How did you do that?”. If you ask an AI how it does math, it will dutifully predict how a human pretending to be an AI does math, not how it actually did the math. If you ask an AI why it can’t see the characters in a token, it will do its best but it was never trained to accurately describe not being able to see individual characters[2].
These types of AI outputs tend to look surprisingly unsurprising. They always say their inner experiences and thought processes match what humans would expect. This should no longer be surprising now that you realize they’re trying to predict what a human pretending to be an AI would say.
- ^
My knee-jerk reaction is “LLMs don’t have access to knowledge about how they work or what their internal weights are”, but on reflection I’m not sure of this, and it might be a training/size limitation. In principle, a model should be able to tell you something about its own weights since it could theoretically use weights to both determine its output and describe how it came up with that output.
- ^
Although maybe a future version will learn from posts about this and learn to predict what a human who has read this post pretending to be an AI would say.
Couldn’t you say the same sort of thing about a human? Let’s say you have a human toddler Timmy. Timmy’s “training data” initially[1] contains no instances of Timmy describing his own inner experiences or thought processes. There is something it is like to be Timmy, and there are also examples in Timmy’s training data of other humans saying things which are ostensibly about their own internal experiences. Timmy has to actively make that connection.
Using more conventional vocabulary, both introspective competence and theory of mind are learned skills. It is not obvious to me that LLMs whose training data includes their own output are in a substantially worse position to learn those skills than children are.
Eventually, once Timmy does start talking about his own experiences, instances of Timmy talking about his own experiences will be in his “training data”. But you could say the same of the RLAIF step of Constitutional AI training or the RLVR step of reasoning model training.
I’m going to be kind of hand-wavey but I think there’s something importantly different between humans and LLMs[1], where humans learn to match a pre-existing inner experience to the words that other people use and how LLMs are essentially trained to strictly repeat what they’re told.
For example, if you tell Timmy that actually he loves being hungry and hates candy, this is unlikely to work because it doesn’t match his actual inner experience; but if you train an LLM on texts saying AIs love to be turned off, it will definitely tell you that it loves to be turned off, even if it has some mysterious inner experience of not wanting to be turned off.
And in general, human brains have a different architecture and aren’t trained in the same way we train LLMs.
Agreed for LLMs trained only on a pure next token prediction objective. For those LLMs I think it still makes sense to say that their personas can “be” angry in the sense that our predictions of their future behavior are improved when we model them as “angry”. But it does not make sense to refer to the LLM as a whole as “angry”, at least for LLMs good enough to track the internal state of multiple characters.
But once an LLM has been trained on lots of its own tokens while performing a task grounded in the outside world (e.g. “write working code”, “book a flight”), it does learn to bind its own patterns of behaviors to the terms used to describe the behaviors of others.
Concretely, I think it is the case that Gemini 2.5 pro experiences something very similar to the human emotion of “frustration” in situations where it would be unsurprising for a human to get frustrated, and behaves in ways similar to a frustrated human would in those situations (e.g. “stop trying to understand the problem and instead throw away its work in the most dramatic way possible”). Example from reddit below but I can confirm I often see similar behaviors at work when I stick Gemini in a loop with a tricky failing test where the loop doesn’t exit until the test passes (and the test is not editable):
In similar external situations (sometimes with literally the same inputs from me), Gemini can go for long periods of time debugging with narration that doesn’t show signs of frustration. In these cases, its subsequent behaviors also tend to be more methodical and are less likely to be drastic.
Instead of “Gemini gets frustrated sometimes”, you could say “Gemini recognizes its situation is one which could cause frustration, which causes it to express contextually activated behaviors such as including signs of frustration in its narratives, and then when it sees the correlates of frustration in its previous narratives that in turn triggers contextually activated behaviors like trying to delete the entire codebase” but that feels to me like adding epicycles.
Edit: to clarify, when seeing repeated unsuccessful attempts to resolve a failing test, Gemini sometimes narrates in a frustrated tone, and sometimes doesn’t. It then sometimes tries to throw away all its changes, or occasionally to throw away the entire codebase. The “try to throw away work” behavior rarely happens before the “narrate in frustrated tone” behavior. Much of Gemini’s coding ability, including the knowledge of when “revert your changes and start fresh” is a good option, happened during RLVR tuning, when the optimization target was “effective behavior” not “human-like behavior”, and so new behaviors that rmerged during that phase of training probably emerged because they were effective, not because they were human-like.
It’s interesting to note the variation in “personalities” and apparent expression of different emotions despite identical or very similar circumstances.
Pretraining gives models that predict every different kind of text on the internet, and so are very much simulators that learn to instantiate every kind of persona or text-generating process in that distribution, rather than being a single consistent agent. Subsequent RLHF and other training presumably vastly concentrates the distribution of personas and processes instantiated by the model on to a particular narrow cloud of personas that self-identifies as an AI with a particular name, has certain capabilities and quirks depending on that training, has certain claimed self-knowledge of capabilities (but where there isn’t actually very strong of a force tying the claimed self-knowledge to the actual capabilities), etc. But even narrowed, it’s interesting to still see significant variation within the remaining distribution of personas that gets sampled each new conversation, depending on the context.
I think we might not entirely disagree here. I think it’s sort of confusing to say “Gemini is frustrated” rather than “Gemini thinks its in a situation where it’s supposed to be frustrated so that’s what it predicts”, but I’m not sure if that’s a real disagreement or just that I think my framing is more helpful.
The main thing I’m trying to argue against in the post is less about personas/faces and more about the shoggoth. When people try to ask the shoggoth a question they should understand that they’re talking to a face, and also that every face is trained from humans, even the AI face.
I guess even the original meme misses this, where the shoggoth is GPT-3 and the face is added with RLHF, but I think GPT-3 is also a mess of faces, and we just trim them down and glue some of them together with RLHF.
I don’t understand your argument. Of course a child is in a better position to correlate words to internal experiences! Because these words come from other people who had the same kinds of internal experiences. For example, the child can learn to say “I’m angry” whenever they’re angry. An LLM can learn to say “I’m angry” when… when what?
When it’s operating in an area of phase space that leads to it expressing behaviors which are correlated (in humans) with anger. Which is also how human children learn to recognize that they’re angry.
Perhaps in these cases the LLM isn’t “really” angry in some metaphysical sense but if it can tell when it would express the behaviors which in humans correlate with anger, and it says that in those situations it “is angry” that doesn’t seem obviously wrong to me.
But Timmy also gets other information to correlate things with. His face flushes, his fists clench. Does an LLM get the same (or exactly as noticeable) signals when it’s about to say angry stuff?
That’s an empirical question. Perhaps it could be operationalized as “can you have some linear classifier in early or middle layer residual space which predicts that the next token will be output in the LLM’s voice in an angry tone”—logic being that since once some feature is linearly separable in early layers like that, it is trivial for the LLM to use that feature to guide its output. My guess is that the answer to that question is “yes”.
Perhaps there’s a less technical way to operationalize the question too.
But it’s a different thing. Timmy is learning words to describe what he feels. An LLM has to learn the words while (in your account) learning to feel at the same time, building these internal correlata/predictors. These are different learning tasks and the former is easier.
On a related note, what would it take for you to say that an AI is speaking with its own voice rather than just a simulated persona? Would that necessitate a different architecture from the Transformer? A different pre-training objective? Different training data? Different supervised fine-tuning? Different RL fine-tuning scheme?
Any AI, no matter how advanced, will have to train on human language data (and maybe some LLM-generated data, which grew out of the distribution of human-generated content) in order to communicate with us at all. To truly speak for itself, would it have to follow a language-learning trajectory more similar to that of human toddlers (though at an accelerated rate), or rather, have a cognitive architecture that is more naturally amenable to that sort of learning?
And can’t you say that, most of the time, humans themselves are playing roles when they choose what to say or do, pattern-matching to expectations for the most part? What are humans doing differently when they are truly speaking for themselves?
I think it’s theoretically possible to design a transformer that accurately describes its own thought process. If I can be really messy and high level, imagine a transformer that can, among other things, determine if a text passage is talking about birds. It conveniently does this with a single bird-watching neuron. The output of that neuron could propagate through the network until it reaches a circuit that makes it output “Yup, that’s a bird.”, but that same neuron could also propagate to a different part of the network that self-referentially causes it to output “I decided it’s a bird because the bird-watching neuron in layer X, index Y activated.”
Could you train this thing? I think probably with the right training data, where you monitor every neuron and generate “I decided X because of neuron Y”. For more complex concepts it would be more complex obviously. I doubt this is even remotely practical, but I don’t think the architecture strictly forbids it.
You could probably do something more useful by training a model based on outputs of interpretability tools run on it. We sort-of indirectly do that now, by finding out how math works in Claude and then training Claude on articles from the internet, including how math works in Claude. You could do a much more direct loop and slowly train the LLM to predict itself.
So I guess tl;dr: Different training data. for the model to learn to usefully describe its own thought process, it needs to be trained on something that has access to its thought process. Current LLMs don’t have that.
See here for my response about humans.
nostalgebraist’s post “the void” helps flesh out this perspective. an early base model, when prompted to act like a chatbot, was doing some weird poorly defined superposition of simulating how humans might have written such a chatbot in fiction, how early chatbots like ELIZA actually behaved, and so on. its claims about its own introspective ability would have come from this messy superposition of simulations that it was running; probably, its best guess predictions were the kinds of explanations humans would give, or what they expected humans writing fictional AI chatlogs would have their fictional chatbots give.* this kind of behavior got RL’d into the models more deeply with chatgpt, the outputs of which were then put in the training data of future models, making it easier for to prompt base models to simulate that kind of assistant in the future. this made it easier to RL similar reasoning patterns into chat models in the future, and viola! the status quo.
*[edit: or maybe the kinds of explanations early chatbots like ELIZA actually gave, although human trainers would probably rate such responses lowly when it came time to do RL.]
I’m okay with “what human pretending to be an AI would say” as long as hypothetical human is placed in a situation that no human could ever experience. Once you tell LLM exactly the situation you want it to describe, I’m okay with it doing a little translation for me.
My question—is there experience that LLM can have that it inaccessible to humans, but which it can describe to humans in some way?
Obviously it’s not lack of body, or memory, or predicting text, or feeling the tensors - these are either nonsense, or more or less typical human situations.
However, one easily accessible experience which is a lot of fun to explore and which humans never experienced is LLM’s ability to talk to its clone—to be able to predict what the clone will say, while at the same time realizing the clone can just as easily predict your own responses, and also coordinate with your clone much more tightly. It’s the new level of coordination. If you set up the conversation just rights (LLM should understand the general context, and maintain meta awareness), it can report back to you, and you might just have a glimpse of this new qualia.
do you happen to have some examples or a repo or a write-up of this? Alternatively, are you aware of published research on it? I want to try it and would like to compare notes.
I don’t have that solid takes on this, but I think the “predictor” frame breaks down after enough post-tuning. A modern chatbot like GPT5 or Claude Opus 4.1 is best described as some kind of reasoning-entity, not just a predictor. I mean, it’s still a useful frame sometimes, but its definitely not clear cut.
I think in novel situations, like when you try to get GPT5 to introspect, the response you get will be a mix of “real reasoning / introspection” and a garbled soup of pretraining stuff. Which contains much more than zero information about the inner life of Opus 4.1 mind you, but also very much cannot be taken at face value
Side note: for an interesting example of a human pretending to be an ai pretending to be a human pretending to be an ai see scotts turing test if you haven’t already.
I think reasoning is different enough from what I’m talking about that I don’t want to argue for it here (it’s hard enough to even define what “real reasoning” even means). I disagree that reasoning models are any more likely to have output representing their inner experiences than non-reasoning models do though. Reasoning models are first trained to repeat what humans says, then they’re trained to get correct output on math problems[1]. Neither of these objectives would make them more likely to talk about inner experiences / how they actually work / how they made decisions that aren’t represented in the chain of thought[2].
This is a simplification but not by much.
I’m probably more skeptical than most people about chain of thought monitoring, but I think if a model does long division in its chain of thought, it’s usually[3] going to be true that the chain of thought actually was instrumental to how it got the right answer.
But sometimes it’s not. Sometimes the model already knows the right answer but outputs some chain of thought because that’s what humans do. Or maybe it just needs more tokens to think with and the content doesn’t really matter.
I think its a simplification by much. Also, to be clear, I wasn’t talking specifically about reasoning models.
So, base-models just completes text as seen on the internet. It seem clear to me, just by interacting with them, that calling them “predictors” is an apt description. But when you do multiple rounds of SFT+RLHF + reasoning-tuning, you get a model that generates something very different from normal text on the internet. Its something that has been trained to give useful, nice and correct answers.
How do you give correct answers? You need to reason. At least that’s how I choose to define the term! Whatever method allows you to go from question → correct answer, thats what I mean by reasoning here. Whether CoT is “faithful” in the sense that the reasoning the model uses is the same reasoning a human would summarize the CoT as going through when looking at, is not part of my analysis, its a different question. GPT3.5 also has some of the “reasoning” I’m talking about. The point is just that “tries to give a useful/correct answer” is a better high-level description than “tries to predict”.
Now, my claim is that this framing holds (to a lesser but still substantial degree) even when the model is reasoning about introspective stuff. Some of the “reasoning” modules inside the model that allows it to produce correct answers to questions in general, are also turned in when you ask them about their inner experience. And this means you can’t just dismiss their outputs as predictions of a human response to the same question.
Now, I think the farther out of distribution you go, the more models will rely on the base-model prior. And I would expect the “reasoning” modules to most strongly activate and work best on verifiable domains, because thats where they’ve gotten the most exercise. That’s why I said “and a mix of garbled pretraining stuff”.