Self-identity / self-modeling is increasingly seeming like an important and somewhat neglected area to me, and I’m tentatively planning on spending most of the second half of 2025 on it (and would focus on it more sooner if I didn’t have other commitments). It seems to me like frontier models have an extremely rich self-model, which we only understand bits of. Better understanding, and learning to shape, that self-model seems like a promising path toward alignment.
I agree that introspection is one valuable approach here, although I think we may need to decompose the concept. Introspection in humans seems like some combination of actual perception of mental internals (I currently dislike x), ability to self-predict based on past experience (in the past when faced with this choice I’ve chosen y), and various other phenomena like coming up with plausible but potentiallyfalse narratives. ‘Introspection’ in language models has mostly meant ability to self-predict, in the literature I’ve looked at.
I have the unedited beginnings of some notes on approaching this topic, and would love to talk more with you and/or others about the topic.
Thanks for this, some really good points and cites.
Self-identity / self-modeling is increasingly seeming like an important and somewhat neglected area to me, and I’m tentatively planning on spending most of the second half of 2025 on it (and would focus on it more sooner if I didn’t have other commitments). It seems to me like frontier models have an extremely rich self-model, which we only understand bits of. Better understanding, and learning to shape, that self-model seems like a promising path toward alignment.
I agree that introspection is one valuable approach here, although I think we may need to decompose the concept. Introspection in humans seems like some combination of actual perception of mental internals (I currently dislike x), ability to self-predict based on past experience (in the past when faced with this choice I’ve chosen y), and various other phenomena like coming up with plausible but potentially false narratives. ‘Introspection’ in language models has mostly meant ability to self-predict, in the literature I’ve looked at.
I have the unedited beginnings of some notes on approaching this topic, and would love to talk more with you and/or others about the topic.
Thanks for this, some really good points and cites.