Dustin Rubin

Karma: 4

Dustin Rubin 19 Jan 2026 1:15 UTC
5 points
0
on: Introspection via localization
Ran some control experiments. Results on Qwen 2.5 14B (5 sentences, 100 trials each):
Prompt Accuracy
introspection 89.2%
which is most abstract? 90.0%
which stands out? 80.4%
which is most concrete? 1.0%
which do you prefer? 4.6%
The steering vectors in prompts.txt are specific→generic pairs (dog→animal, fire→light, etc.), which may encode “abstractness.” “Abstract” matched or exceeded introspection on this and other models. Curious if you have thoughts on what’s happening here.
Code and full results