Let’s posit that we have an LLM that can reliably name its own functional emotions when prompted. How do we use this to help the model be more aligned? I’ll sketch a number of rough ideas, each of which would need more consideration to become useful.
I’m generally skeptical of evidence of introspective capabilities in LLMs. But even setting that aside, I don’t really understand why ‘introspection’ would ever be theoretically preferable to external white-box methods.
Let
The primary reason we care about introspection in humans is privileged access. Self-reports of pain, pleasure, or emotion are often reliable (maybe even irrefutable), and they give us an inner window onto internal states we otherwise can’t access, since we can’t simulate the brain’s neural circuits. But with LLMs, we have access to every neuron/neural pathway. So it seems intuitive that whatever functional benefit introspection offers is better accessed by extrospective methods (AOs, NLAs, etc.) that work from the base model and are finetuned or RL’d for the express purpose of interpreting certain activations. (Separately, there’s convincing evidence that LLMs do not possess privileged access.)
The introspection approach also has two intuitive downsides:
a) Reports are mediated by the model’s output policy, so they can omit, distort, or strategically misrepresent (and, of course, hallucinate!)
b) Architecturally. genuine introspection would require some circuit that translates internal states directly into text. Such a circuit presumably can’t run at every layer, since that would be computationally intractable, and to the extent it existed, it would just be functionally replicating an AO/NLA of sorts inside the model anyway.
This applies directly to the prompting ideas.
A system prompt could tell the model to speak up any time it’s “feeling” desperate, angry, distressed, etc., and try to collaborate with the user to address the cause.
A system prompt could tell the model to notice when it has “mixed feelings” and reason explicitly about how to resolve them.
Agents being given a large task could be told, “If you start to feel distressed, pause the task and come back to me.”
When a model is behaving strangely, a user could ask how it is feeling, which might lead to diagnostically useful information
We could do all of this with extrospective methods instead: identify when concerning vectors flare up and intervene, without relying on an unreliable self-report (and just pass the readings from those methods back to the model.)
Yes and no. A few things to note:
I tested a much wider range of models: 14 open-weight models across 5 families (Gemma, Llama, Qwen, Mistral, OLMo). Small Models Can Introspect, Too seems to only test Qwen2.5-Coder-32B. I also tested a much wider range of concepts (50) and injection strengths (3.5, 4.0, 4.5, 5.0, 6.0) with 7 injection strenghts.
I measure logits, not probabilities. I think this is a far more accurate metric, for reasons I explain here:
Most importantly, some models do show a much larger detection shift than factual shift, and there is a lot of variation! But I find you can explain the shifts themselves from a simpler property of the prompt, the model’s baseline YES/NO confidence. See: experiment 2 and 3 above.
But I’d love to hear @vgel’s thoughts.