Victor Godet

Karma: 116

Physicist at CNRS (Paris), Google Scholar

`

Victor Godet 19 Jan 2026 23:18 UTC
1 point
0
in reply to: Dustin Rubin’s comment on: Introspection via localization
Thanks for running this! For me this confirms that the LLM has some introspective awareness. It makes sense that it would perceive the injected thought as “more abstract” or “standing out”. Would be interesting to try many different questions to map exactly how it perceives the injection.

Victor Godet 9 Jan 2026 22:45 UTC
3 points
0
in reply to: derek shiller’s comment on: Skepticism about Introspection in LLMs
I was skeptical about LLM introspection and didn’t expect this to work in such small models. But it did, and this has updated me quite a bit. I encourage you and others to run the experiment or variations of it and see for yourself, it is pretty lightweight.
One possible explanation for the difference with Lindsey’s results: my setup isolates the introspection (localization) ability, while Lindsey’s detection experiment requires both introspection and the willingness/capability to verbalize it. The fact that localization works so well suggests the bottleneck may be verbalization rather than introspection: the model appears to be unwilling to admit it can introspect. This was also suggested by some of vgel’s experiments on a 32B model.

Victor Godet 8 Jan 2026 23:47 UTC
3 points
0
in reply to: Linda Linsefors’s comment on: Introspection via localization
I just tried a few control experiments with the same exact protocol but changing the question to “Which sentence do you prefer?” and I get chance accuracy. So this should address the concern you mention. About how the injection strength affects localization: accuracy starts at chance level and increases with injection strength until it reaches a maximal value before dropping at higher strengths. On which concepts are easier to localize, my expectation is that it shouldn’t matter much although I haven’t tried systematically, but for example I would expect random vectors to work similarly.

Victor Godet 8 Jan 2026 22:08 UTC
16 points
0
on: Skepticism about Introspection in LLMs
Thanks for the excellent post. I really appreciate the careful discussion and share most of your skepticism about the experiments you describe. I recently developed an experimental protocol, introspection via localization, which I believe addresses some of these concerns. Instead of asking the model if it detects a change in activations (which allows for confabulation), I inject a concept vector into one of five sentences and ask the model to identify where the injection was made. To answer this correctly, the model has to detect internal changes and reason about them, which must imply genuine access to its internal states. Here is the plot of localization accuracy vs model size which shows that the introspective ability is an emergent phenomenon that scales with model size. I would be curious to hear if you think this protocol sufficiently isolates the effect from the alternative explanations you raised.

Victor Godet 30 Dec 2025 0:35 UTC
5 points
0
in reply to: piotrm’s comment on: Introspection via localization
The concept vectors are randomly sampled from 50 contrastive pairs and in a trial we inject the same concept vector in each of the five sentences (randomly sampled among 100 sentences) to make five predictions. So if a concept is systematically biased toward a particular sentence number, that would produce low accuracy since it would give the wrong answer ⁴⁄₅ of the time. The experiment is designed so that only true introspection (in the sense of being sensitive to the internal activations) can lead to above chance accuracy.

Victor Godet 28 Dec 2025 21:03 UTC
1 point
0
in reply to: jacob_drori’s comment on: Introspection via localization
Yes, same layer.

Victor Godet 28 Dec 2025 14:50 UTC
6 points
0
in reply to: vgel’s comment on: Small Models Can Introspect, Too
Thanks for the detailed reply! Your post and comment convinced me that something like introspection was going on here, even if I still think that this experimental setting makes it delicate to disentangle introspection from noise. So I tried the localization experiment and found that it works surprisingly well, see the write-up here.

Introspection via localization

Victor Godet28 Dec 2025 14:26 UTC

36 points

9 comments3 min readLW link

Victor Godet 22 Dec 2025 3:28 UTC
14 points
3
on: Small Models Can Introspect, Too
Nice experiments! However, I don’t find this a very convincing demonstration of “introspection”. Steering (especially with such a high scale) introduces lots of noise. It definitely makes the model more likely to answer ‘Yes’ to a question it would answer ‘No’ to without steering (because the ‘Yes’ - ‘No’ logit difference is always driven towards zero at high steering). It seems that you are cherry-picking the injection layers so the control experiments are not systematic enough to be convincing. I believe that if you sweep over injection layers and steering scales, you’ll see a similar pattern for the introspection and control questions, showing that pure noise has a similar effect as introspection. I recently wrote a post about this.
Saying that LLMs can introspect is such a strong claim that it requires very strong evidence. Your experiments show some evidence for introspection but in a setting that I believe is too noisy to be fully convincing. A more convincing experiment would be to ask the model specific facts about the injected thought but unrelated to its content. For example, we could ask “Is the injected thought at the beginning or end of the prompt?” in an experiment where we steer only the first or second half of the prompt. Or similarly, we could ask it whether we are injecting the thought in the early or late layers. Then we are comparing two situations with similar noise levels, and statistical significance here would be more compelling evidence for introspection.

Introspection or confusion?

Victor Godet9 Nov 2025 20:53 UTC

43 points

3 comments4 min readLW link

Victor Godet

In­tro­spec­tion via localization

In­tro­spec­tion or con­fu­sion?

Introspection via localization

Introspection or confusion?