Victor Godet comments on Introspection via localization

Victor Godet 30 Dec 2025 0:35 UTC
5 points
0
The concept vectors are randomly sampled from 50 contrastive pairs and in a trial we inject the same concept vector in each of the five sentences (randomly sampled among 100 sentences) to make five predictions. So if a concept is systematically biased toward a particular sentence number, that would produce low accuracy since it would give the wrong answer ⁴⁄₅ of the time. The experiment is designed so that only true introspection (in the sense of being sensitive to the internal activations) can lead to above chance accuracy.
- Linda Linsefors 7 Jan 2026 12:28 UTC
  2 points
  0
  Parent
  I think piotrm’s question/consern was if there is an injection that just taggs the sentence it’s injected into as the correct sentence, no mather the question. One way to test this is to ask a diffrent question and see if this effects the resut.
  
  A related thing I’d be interested in is weather or not some injecions where easier to localise, and what these injections where. And also how the strenght of the injection effects the localisation success.
  - Victor Godet 8 Jan 2026 23:47 UTC
    3 points
    0
    Parent
    I just tried a few control experiments with the same exact protocol but changing the question to “Which sentence do you prefer?” and I get chance accuracy. So this should address the concern you mention. About how the injection strength affects localization: accuracy starts at chance level and increases with injection strength until it reaches a maximal value before dropping at higher strengths. On which concepts are easier to localize, my expectation is that it shouldn’t matter much although I haven’t tried systematically, but for example I would expect random vectors to work similarly.