Physicist at CNRS (Paris), Google Scholar
`
Victor Godet
I was skeptical about LLM introspection and didn’t expect this to work in such small models. But it did, and this has updated me quite a bit. I encourage you and others to run the experiment or variations of it and see for yourself, it is pretty lightweight.
One possible explanation for the difference with Lindsey’s results: my setup isolates the introspection (localization) ability, while Lindsey’s detection experiment requires both introspection and the willingness/capability to verbalize it. The fact that localization works so well suggests the bottleneck may be verbalization rather than introspection: the model appears to be unwilling to admit it can introspect. This was also suggested by some of vgel’s experiments on a 32B model.
I just tried a few control experiments with the same exact protocol but changing the question to “Which sentence do you prefer?” and I get chance accuracy. So this should address the concern you mention. About how the injection strength affects localization: accuracy starts at chance level and increases with injection strength until it reaches a maximal value before dropping at higher strengths. On which concepts are easier to localize, my expectation is that it shouldn’t matter much although I haven’t tried systematically, but for example I would expect random vectors to work similarly.
Thanks for the excellent post. I really appreciate the careful discussion and share most of your skepticism about the experiments you describe. I recently developed an experimental protocol, introspection via localization, which I believe addresses some of these concerns. Instead of asking the model if it detects a change in activations (which allows for confabulation), I inject a concept vector into one of five sentences and ask the model to identify where the injection was made. To answer this correctly, the model has to detect internal changes and reason about them, which must imply genuine access to its internal states. Here is the plot of localization accuracy vs model size which shows that the introspective ability is an emergent phenomenon that scales with model size. I would be curious to hear if you think this protocol sufficiently isolates the effect from the alternative explanations you raised.
The concept vectors are randomly sampled from 50 contrastive pairs and in a trial we inject the same concept vector in each of the five sentences (randomly sampled among 100 sentences) to make five predictions. So if a concept is systematically biased toward a particular sentence number, that would produce low accuracy since it would give the wrong answer 4⁄5 of the time. The experiment is designed so that only true introspection (in the sense of being sensitive to the internal activations) can lead to above chance accuracy.
Yes, same layer.
Thanks for the detailed reply! Your post and comment convinced me that something like introspection was going on here, even if I still think that this experimental setting makes it delicate to disentangle introspection from noise. So I tried the localization experiment and found that it works surprisingly well, see the write-up here.
Introspection via localization
Nice experiments! However, I don’t find this a very convincing demonstration of “introspection”. Steering (especially with such a high scale) introduces lots of noise. It definitely makes the model more likely to answer ‘Yes’ to a question it would answer ‘No’ to without steering (because the ‘Yes’ - ‘No’ logit difference is always driven towards zero at high steering). It seems that you are cherry-picking the injection layers so the control experiments are not systematic enough to be convincing. I believe that if you sweep over injection layers and steering scales, you’ll see a similar pattern for the introspection and control questions, showing that pure noise has a similar effect as introspection. I recently wrote a post about this.
Saying that LLMs can introspect is such a strong claim that it requires very strong evidence. Your experiments show some evidence for introspection but in a setting that I believe is too noisy to be fully convincing. A more convincing experiment would be to ask the model specific facts about the injected thought but unrelated to its content. For example, we could ask “Is the injected thought at the beginning or end of the prompt?” in an experiment where we steer only the first or second half of the prompt. Or similarly, we could ask it whether we are injecting the thought in the early or late layers. Then we are comparing two situations with similar noise levels, and statistical significance here would be more compelling evidence for introspection.
Thanks for running this! For me this confirms that the LLM has some introspective awareness. It makes sense that it would perceive the injected thought as “more abstract” or “standing out”. Would be interesting to try many different questions to map exactly how it perceives the injection.