New Anthropic research (tweet, blog post, paper):
We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.
See the Transformer Circuits post for more. Here’s the first figure, explaining one of the main experiments:
Personally, I’m pretty excited by this work! (While I work at Anthropic, I wasn’t involved in this research myself beyond commenting on some early drafts.)
I’m still worried about various kinds of confounders and “boring” explanations for the observations here, but overall this was a substantial update for me in favor of recent models having nontrivial subjective experience, and is the best attempt I know of yet to empirically investigate such things. I think the paper does a pretty good job of trying to distinguish “true” introspection from explanations like “it’s just steering the model to say words about a topic”.
This is a really fascinating paper. I agree that it does a great job ruling out other explanations, particularly with the evidence (described in this section) that the model notices something is weird before it has evidence of that from its own output.
Although this was somewhat of an update for me also (especially because this sort of direct introspection seems like it could plausibly be a necessary condition for conscious experience), I also think it’s entirely plausible that models could introspect in this way without having subjective experience (at least for most uses of that word, especially as synonymous with qualia).
I think the Q&A in the blog post puts this pretty well:
(further discussion elided, see post for more)
I’m really encouraged by research that attempts interventions like this rather than the ridiculous “This LLM introspects, because when I repeatedly prompted it about introspection it told me it does” tests.
I do wonder with the only 20% success rate how that would compare to humans? (I do like the ocean vector failed example—“I don’t detect an injected thought. The ocean remains calm and undisturbed.”)
I’m not sure if one could find a comparable metric to observe in human awareness of influences on their cognition… i.e. “I am feeling this way because of [specific exogenous variable]”?
Isn’t that the entire point of using activities like Focusing, to hone and teach us to notice thoughts, feelings, and affect which otherwise go un-noticed? Particularly in light of the complexity of human thought and the huge amount of processes which are constantly going on unnoticed (for example, nervous tics which I’ve only become aware of when someone has pointed them out to me, but others might be saccades—we’re not aware or notice each individual saccade, only the ‘gestalt’ of where out gaze goes—and even then involuntary interventions that operate faster than we can notice can shift our gaze, like when someone yells out for help, or calls your name. Not to mention Nudge Theory and Priming)
Now this is truly interesting. As other commentators, and the article itself, have pointed out, it’s very difficult to discern true introspection (the model tracking and reasoning effectively about its internal activations) from more mundane explanations (activating internal neurons related to bread make a bread-related output more likely, and caused the model to be less certain later on that the original text was not somehow related to bread until it tried to reason through the connection manually).
I remember a paper on world-modeling, where the authors proved pretty concretely that their LLM had a working world model by training a probe to predict the future world state of a gridworld robot programming environment based on the model’s last layer activations after each command it entered, and then showing that a counterfactual future world state, in which rotation commands had reversed meanings, could not be similarly predicted. I’m not entirely sure how I would apply the same approach here, but it seems broadly pertinent in that it’s likewise a very difficult-to-nail-down capability. Unfortunate that “train a probe to predict the model’s internal state given the model’s internal state” is a nonstarter.
This was my takeaway too. The explanation of an LLM learning to model its future activations and using the accuracy of this model to determine the entropy of a body of text seems compelling. I wonder where the latent “anomaly” activations found by comparing the set of activations for true positives in this experiment with the activations for the control group would be, and whether they’d be related to the activation patterns associated with any other known concepts.
Is this the paper you’re talking about : https://arxiv.org/abs/2309.00941
Nope—that’s not the one. Did a bit of searching, and it’s here.
Thanks!
Very interesting!
I’m confused by this section:
I thought that an LLM’s responses for each turn are generated entirely separately from each other, so that when you give it an old conversation history with some of its messages included, it re-reads the whole conversation from scratch and then generates an entirely new response. In that case, it shouldn’t matter what you injected into its activations during a previous conversation turn, since only the resulting textual output is used for calculating new activations and generating the next response. Do I have this wrong?
You can think of what is being done here as the experiment done here interfering with the “re-reading” of the response.
A simple example: when the LLM sees the word “brot” in German, it probably “translates” it internally into “bread” at the position where the “brot” token is, and so if you tamper with activations at the “brot” position (on all forward passes—though in practice you only need to do it the first time “brot” enters the context if you do KV caching) it will have effects many tokens later. In the Transformer architecture the process that computes the next token happens in part “at” previous tokens so it makes sense to tamper with activations “at” previous tokens.
Maybe this diagram from this post is helpful (though it’s missing some arrows to reduce clutter).
Right, so the “retroactively” means that it doesn’t inject the vector when the response is originally prefilled, but rather when the model is re-reading the conversation with the prefilled response and it gets to the point with the bread? That makes sense.
Read “injected at the index of a previous conversational turn”
In case anyone wants it, Rob Long wrote an excellent summary and analysis of this paper here.
This paper makes me really curious:
Could models trained to introspect “daydream” or boost features of their choosing, a la “don’t think about aquariums”?
A continuation of #1, could models be given a tool to boost their own features to assist with specific tasks, (for example, features related to reasoning/hard problem solving) and would this lead to meaningful results/be useful?
Perhaps less fanciful than the above, could smaller models have permanent feature boostings to help focus them to narrow tasks?
You might be interested in https://www.lesswrong.com/posts/dvbRv97GpRg5gXKrf/run-time-steering-can-surpass-post-training-reasoning-task
I wonder if the technique can be used to get around guardrails on open weight models. If you can get the activation pattern for some nefarious concept, you could activate it alongside an innocent prompt’s pattern—maybe getting the llm to take more negative actions.
Possibly yes, but I don’t think that’s a legitimate safety concern since this can already be done very easily with other techniques. And for this technique you would need to model diff with a nonrefusal prompt of the bad concept in the first place, so the safety argument is moot. But sounds like an interesting research question
Really cool paper. I am a bit unsure about the implication in this section particularly:
How do we know that it is “intentional” on part of the model, versus the more benign explanations of attending to a very salient instruction (think of X), versus the attention of a less salient instruction (don’t think of X). One workaround could be to try to say “Write about something related/unrelated to [concept] while thinking of it at X% of your mental headspace” or similar. Even better if we can test something like lying or sycophancy.
Usually I wouldn’t ask for repeating the same experiment with a new model, but Claude Sonnet 4.5 has qualitatively different levels of self/eval awareness. It would be interesting to know if we can measure that using our interpretability tools, and if it has different behaviours on some introspection tests as well.