White-box methods such as PatchScopes show that activation patching allows an LM to answer questions about its activations. LatentQA fine-tunes LMs to be explicitly good at this.
This kind of stuff seems particularly interesting because:
Introspection might naturally scale with general capabilities (this is supported by initial results). Future language models could have even better self-models, and influencing / leveraging these self-models may be a promising pathway towards safety.
Introspection might be more tractable. Alignment is difficult and possibly not even well-specified, but “answering questions about yourself truthfully” seems like a relatively well-defined problem (though it does require some way to oversee what is ‘truthful’)
Failure modes include language models becoming deceptive, less legible, or less faithful. It seems important to understand whether each failure mode will happen, and corresponding mitigations
Research directions that might be interesting:
Scaling laws for introspection
Understanding failure modes
Fair comparisons to other interpretability methods
Training objectives for better (‘deeper’, more faithful, etc) introspection
Self-identity / self-modeling is increasingly seeming like an important and somewhat neglected area to me, and I’m tentatively planning on spending most of the second half of 2025 on it (and would focus on it more sooner if I didn’t have other commitments). It seems to me like frontier models have an extremely rich self-model, which we only understand bits of. Better understanding, and learning to shape, that self-model seems like a promising path toward alignment.
I agree that introspection is one valuable approach here, although I think we may need to decompose the concept. Introspection in humans seems like some combination of actual perception of mental internals (I currently dislike x), ability to self-predict based on past experience (in the past when faced with this choice I’ve chosen y), and various other phenomena like coming up with plausible but potentiallyfalse narratives. ‘Introspection’ in language models has mostly meant ability to self-predict, in the literature I’ve looked at.
I have the unedited beginnings of some notes on approaching this topic, and would love to talk more with you and/or others about the topic.
Thanks for this, some really good points and cites.
Partially agreed. I’ve tested this a little personally; Claude successfully predicted their own success probability on some programming tasks, but was unable to report their own underlying token probabilities. The former tests weren’t that good, the latter ones somewhat were okay, I asked Claude to say the same thing across 10 branches and then asked a separate thread of Claude, also downstream of the same context, to verbally predict the distribution.
That’s pretty interesting! I would guess that it’s difficult to elicit introspection by default. Most of the papers where this is reported to work well involve fine-tuning the models. So maybe “willingness to self-report honestly” should be something we train models to do.
willingness seems likely to be understating it. a context where the capability is even part of the author context seems like a prereq. finetuning would produce that, with fewshot one has to figure out how to make it correlate. I’ll try some more ideas.
a context where the capability is even part of the author context
Can you unpack that a bit? I’m not sure what you’re pointing to. Maybe something like: few-shot examples of correct introspection (assuming you can identify those)?
“Just ask the LM about itself” seems like a weirdly effective way to understand language models’ behaviour.
There’s lots of circumstantial evidence that LMs have some concept of self-identity.
Language models’ answers to questions can be highly predictive of their ‘general cognitive state’, e.g. whether they are lying or their general capabilities
Language models know things about themselves, e.g. that they are language models, or how they’d answer questions, or their internal goals / values
Language models’ self-identity may directly influence their behaviour, e.g. by making them resistant to changes in their values / goals
Language models maintain beliefs over human identities, e.g. by inferring gender, race, etc from conversation. They can use these beliefs to exploit vulnerable individuals.
Some work has directly tested ‘introspective’ capabilities.
An early paper by Ethan Perez and Rob Long showed that LMs can be trained to answer questions about themselves. Owain Evans’ group expanded upon this in subsequent work.
White-box methods such as PatchScopes show that activation patching allows an LM to answer questions about its activations. LatentQA fine-tunes LMs to be explicitly good at this.
This kind of stuff seems particularly interesting because:
Introspection might naturally scale with general capabilities (this is supported by initial results). Future language models could have even better self-models, and influencing / leveraging these self-models may be a promising pathway towards safety.
Introspection might be more tractable. Alignment is difficult and possibly not even well-specified, but “answering questions about yourself truthfully” seems like a relatively well-defined problem (though it does require some way to oversee what is ‘truthful’)
Failure modes include language models becoming deceptive, less legible, or less faithful. It seems important to understand whether each failure mode will happen, and corresponding mitigations
Research directions that might be interesting:
Scaling laws for introspection
Understanding failure modes
Fair comparisons to other interpretability methods
Training objectives for better (‘deeper’, more faithful, etc) introspection
Self-identity / self-modeling is increasingly seeming like an important and somewhat neglected area to me, and I’m tentatively planning on spending most of the second half of 2025 on it (and would focus on it more sooner if I didn’t have other commitments). It seems to me like frontier models have an extremely rich self-model, which we only understand bits of. Better understanding, and learning to shape, that self-model seems like a promising path toward alignment.
I agree that introspection is one valuable approach here, although I think we may need to decompose the concept. Introspection in humans seems like some combination of actual perception of mental internals (I currently dislike x), ability to self-predict based on past experience (in the past when faced with this choice I’ve chosen y), and various other phenomena like coming up with plausible but potentially false narratives. ‘Introspection’ in language models has mostly meant ability to self-predict, in the literature I’ve looked at.
I have the unedited beginnings of some notes on approaching this topic, and would love to talk more with you and/or others about the topic.
Thanks for this, some really good points and cites.
Partially agreed. I’ve tested this a little personally; Claude successfully predicted their own success probability on some programming tasks, but was unable to report their own underlying token probabilities. The former tests weren’t that good, the latter ones somewhat were okay, I asked Claude to say the same thing across 10 branches and then asked a separate thread of Claude, also downstream of the same context, to verbally predict the distribution.
That’s pretty interesting! I would guess that it’s difficult to elicit introspection by default. Most of the papers where this is reported to work well involve fine-tuning the models. So maybe “willingness to self-report honestly” should be something we train models to do.
willingness seems likely to be understating it. a context where the capability is even part of the author context seems like a prereq. finetuning would produce that, with fewshot one has to figure out how to make it correlate. I’ll try some more ideas.
Can you unpack that a bit? I’m not sure what you’re pointing to. Maybe something like: few-shot examples of correct introspection (assuming you can identify those)?