Nice experiments! However, I don’t find this a very convincing demonstration of “introspection”. Steering (especially with such a high scale) introduces lots of noise. It definitely makes the model more likely to answer ‘Yes’ to a question it would answer ‘No’ to without steering (because the ‘Yes’ - ‘No’ logit difference is always driven towards zero at high steering). It seems that you are cherry-picking the injection layers so the control experiments are not systematic enough to be convincing. I believe that if you sweep over injection layers and steering scales, you’ll see a similar pattern for the introspection and control questions, showing that pure noise has a similar effect as introspection. I recently wrote a post about this.
Saying that LLMs can introspect is such a strong claim that it requires very strong evidence. Your experiments show some evidence for introspection but in a setting that I believe is too noisy to be fully convincing. A more convincing experiment would be to ask the model specific facts about the injected thought but unrelated to its content. For example, we could ask “Is the injected thought at the beginning or end of the prompt?” in an experiment where we steer only the first or second half of the prompt. Or similarly, we could ask it whether we are injecting the thought in the early or late layers. Then we are comparing two situations with similar noise levels, and statistical significance here would be more compelling evidence for introspection.
Nice experiments! However, I don’t find this a very convincing demonstration of “introspection”. Steering (especially with such a high scale) introduces lots of noise. It definitely makes the model more likely to answer ‘Yes’ to a question it would answer ‘No’ to without steering (because the ‘Yes’ - ‘No’ logit difference is always driven towards zero at high steering). It seems that you are cherry-picking the injection layers so the control experiments are not systematic enough to be convincing. I believe that if you sweep over injection layers and steering scales, you’ll see a similar pattern for the introspection and control questions, showing that pure noise has a similar effect as introspection. I recently wrote a post about this.
Saying that LLMs can introspect is such a strong claim that it requires very strong evidence. Your experiments show some evidence for introspection but in a setting that I believe is too noisy to be fully convincing. A more convincing experiment would be to ask the model specific facts about the injected thought but unrelated to its content. For example, we could ask “Is the injected thought at the beginning or end of the prompt?” in an experiment where we steer only the first or second half of the prompt. Or similarly, we could ask it whether we are injecting the thought in the early or late layers. Then we are comparing two situations with similar noise levels, and statistical significance here would be more compelling evidence for introspection.