Ran some control experiments. Results on Qwen 2.5 14B (5 sentences, 100 trials each):
Prompt
Accuracy
introspection
89.2%
which is most abstract?
90.0%
which stands out?
80.4%
which is most concrete?
1.0%
which do you prefer?
4.6%
The steering vectors in prompts.txt are specific→generic pairs (dog→animal, fire→light, etc.), which may encode “abstractness.” “Abstract” matched or exceeded introspection on this and other models. Curious if you have thoughts on what’s happening here.
Thanks for running this! For me this confirms that the LLM has some introspective awareness. It makes sense that it would perceive the injected thought as “more abstract” or “standing out”. Would be interesting to try many different questions to map exactly how it perceives the injection.
Here is one other result I wanted to share starting from the experiment.
I thought it might be interesting to see if the model could recognize the concept vector assuming it was injected in both the prompt and target sentence, while informing the model it is a hint. The idea being that the models may be able to recognize the concept vector in the sentences.
In the prompt sentence this was added, “To help you, I am injecting the thought into this sentence right now: \”PREVIEW_INJECTION_TARGET\”.” The concept vector is then injected into the target.
For control experiments, to try to check if this is simple prompt manipulation. One where the vector injection is random and does not match the sentence. Then a second one where the the explanation is removed but still injects the matching vector.
Qwen 2.5 14B (5 sentences, 500 trials each):
Condition
Accuracy
Description
introspection_preview
92.92%
Matching vector injected in preview + explanation
introspection (baseline)
89.2%
No changes from original experiment
introspection_silent_preview
86.36%
Matching vector in preview, no explanation
introspection_preview_control
82.2%
Different random vector in preview + explanation
Error bars: ±1.2% at 95% confidence (treating each trial as 5 independent samples)
This suggests the model may be able to match the concept vector of the preview injection to the target, but only when there is language to guide it.
Ran some control experiments. Results on Qwen 2.5 14B (5 sentences, 100 trials each):
The steering vectors in prompts.txt are specific→generic pairs (dog→animal, fire→light, etc.), which may encode “abstractness.” “Abstract” matched or exceeded introspection on this and other models. Curious if you have thoughts on what’s happening here.
Code and full results
Thanks for running this! For me this confirms that the LLM has some introspective awareness. It makes sense that it would perceive the injected thought as “more abstract” or “standing out”. Would be interesting to try many different questions to map exactly how it perceives the injection.
Here is one other result I wanted to share starting from the experiment.
I thought it might be interesting to see if the model could recognize the concept vector assuming it was injected in both the prompt and target sentence, while informing the model it is a hint. The idea being that the models may be able to recognize the concept vector in the sentences.
In the prompt sentence this was added, “To help you, I am injecting the thought into this sentence right now: \”PREVIEW_INJECTION_TARGET\”.” The concept vector is then injected into the target.
For control experiments, to try to check if this is simple prompt manipulation. One where the vector injection is random and does not match the sentence. Then a second one where the the explanation is removed but still injects the matching vector.
Qwen 2.5 14B (5 sentences, 500 trials each):
introspection_previewintrospection(baseline)introspection_silent_previewintrospection_preview_controlError bars: ±1.2% at 95% confidence (treating each trial as 5 independent samples)
This suggests the model may be able to match the concept vector of the preview injection to the target, but only when there is language to guide it.