Small question/concern whether this accuracy can be attributed to “introspection” or something that we wouldn’t call introspection. Depending on the injected concept, I could see it being far from introspection. I’m unsure what concepts where injected but I would find it plausible that some could cause the accuracy independent of the instructions given to the LLM. For example, a concept that would \emph{always} result in the LLM generating the index of the sentence it is located in, regardless of introspection task. Is there a way to control for such things?
The concept vectors are randomly sampled from 50 contrastive pairs and in a trial we inject the same concept vector in each of the five sentences (randomly sampled among 100 sentences) to make five predictions. So if a concept is systematically biased toward a particular sentence number, that would produce low accuracy since it would give the wrong answer 4⁄5 of the time. The experiment is designed so that only true introspection (in the sense of being sensitive to the internal activations) can lead to above chance accuracy.
I think piotrm’s question/consern was if there is an injection that just taggs the sentence it’s injected into as the correct sentence, no mather the question. One way to test this is to ask a diffrent question and see if this effects the resut.
A related thing I’d be interested in is weather or not some injecions where easier to localise, and what these injections where. And also how the strenght of the injection effects the localisation success.
I just tried a few control experiments with the same exact protocol but changing the question to “Which sentence do you prefer?” and I get chance accuracy. So this should address the concern you mention. About how the injection strength affects localization: accuracy starts at chance level and increases with injection strength until it reaches a maximal value before dropping at higher strengths. On which concepts are easier to localize, my expectation is that it shouldn’t matter much although I haven’t tried systematically, but for example I would expect random vectors to work similarly.
Small question/concern whether this accuracy can be attributed to “introspection” or something that we wouldn’t call introspection. Depending on the injected concept, I could see it being far from introspection. I’m unsure what concepts where injected but I would find it plausible that some could cause the accuracy independent of the instructions given to the LLM. For example, a concept that would \emph{always} result in the LLM generating the index of the sentence it is located in, regardless of introspection task. Is there a way to control for such things?
The concept vectors are randomly sampled from 50 contrastive pairs and in a trial we inject the same concept vector in each of the five sentences (randomly sampled among 100 sentences) to make five predictions. So if a concept is systematically biased toward a particular sentence number, that would produce low accuracy since it would give the wrong answer 4⁄5 of the time. The experiment is designed so that only true introspection (in the sense of being sensitive to the internal activations) can lead to above chance accuracy.
I think piotrm’s question/consern was if there is an injection that just taggs the sentence it’s injected into as the correct sentence, no mather the question. One way to test this is to ask a diffrent question and see if this effects the resut.
A related thing I’d be interested in is weather or not some injecions where easier to localise, and what these injections where. And also how the strenght of the injection effects the localisation success.
I just tried a few control experiments with the same exact protocol but changing the question to “Which sentence do you prefer?” and I get chance accuracy. So this should address the concern you mention. About how the injection strength affects localization: accuracy starts at chance level and increases with injection strength until it reaches a maximal value before dropping at higher strengths. On which concepts are easier to localize, my expectation is that it shouldn’t matter much although I haven’t tried systematically, but for example I would expect random vectors to work similarly.