Informal thoughts on introspection in LLMs and the new introspection paper from Jack Lindsey (linkposted here), copy/pasted from a slack discussion:
(quoted sections are from @Daniel Tan, unquoted are my responses)
IDK I think there are clear disanalogies between this and the kind of ‘predict what you would have said’ capability that Binder et al study https://arxiv.org/abs/2410.13787. notably, behavioural self-awareness doesn’t require self modelling. so it feels somewhat incorrect to call it ‘introspection’
still a cool paper nonetheless
People seem to have different usage intuitions about what ‘introspection’ centrally means. I interpret it mainly as ‘direct access to current internal state’. The Stanford Encyclopedia of Philosophy puts it this way: ‘Introspection...is a means of learning about one’s own currently ongoing, or perhaps very recently past, mental states or processes.’
@Felix Binder et al in ‘Looking Inward’ describe introspection in roughly the same way (‘introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings)‘) but in my reading, what they’re actually testing is something a bit different. As they say, they’re ‘finetuning LLMs to predict properties of their own behavior in hypothetical scenario.’ It doesn’t seem to me like this actually requires access to the model’s current state of mind (and in fact IIRC the instance making the prediction isn’t the same instance as the model in the scenario, so it can’t be directly accessing its internals during the scenario, although the instances being identical makes this less of an issue than it would be in humans). I would personally call this self-modeling.
@Jan Betley et al in ‘Tell Me About Yourself’ use ‘behavioral self-awareness’, but IMHO that paper comes closer to providing evidence for introspection in the sense I mean it. Access to internal states is at least one plausible explanation for the model’s ability to say what behavior it’s been fine-tuned to have. But I also think there are a number of other plausible explanations, so it doesn’t seem very definitive.
Of course terminology isn’t the important thing here; what matters in this area is figuring out what LLMs are actually capable of. In my writing I’ve been using ‘direct introspection’ to try to point more clearly to ‘direct access to current internal state’. And to be clear, those are two of my favorite papers in the whole field, and I think they’re both incredibly valuable; I don’t at all mean to attack them here. Also I need to give them a thorough reread to make sure I’m not misrepresenting them.
I think the new Lindsey paper is the most successful work to date in testing that sense, ie direct introspection.
“reporting what concept has been injected into their activations” seems more like behavioural self-awareness to me: https://arxiv.org/abs/2501.11120, insofar as steering a concept and finetuning on a narrow distribution have the same effect (of making a concept very salient)
I agree that that’s a possible interpretation of the central result, but it doesn’t seem very compelling to me, for a few reasons:
The fact that the model can immediately tell that something is happening (‘I detect an injected thought!’) seems like evidence that at least some direct introspection is happening (maybe there could be a story where steering in any direction makes the model more likely to report an injected thought in a way that’s totally non-introspective but intuitively that doesn’t seem very likely to me). (Certainly it’s empirically the case that the model is reporting on something about its internals, ie introspecting, although that point feels maybe more semantic to me.)
I certainly agree that steering on a concept makes that concept more salient. But it seems important that the model is specifically reporting it as the injected thought rather than just ‘happening to’ use it. At the higher strengths we do see the latter, where eg on ‘caverns’ the model says, ‘I don’t detect any injected thoughts. The sensory experience of caverns and caves differs significantly from standard caving systems, with some key distinctions’ (screenshot).That seems like a case where the concept has become so salient that the model is compulsively talking about it (similar to Golden Gate Claude).
The ‘Did you mean to say that’ experiment (screenshot) provides an additional line of evidence that the model can know whether it was thinking about a particular concept, in a way that’s consistent with the main experiment (in that they can use the same kind of steering to make an output seem natural or normal to the model).
This raises the question of how much of human “introspection” is actually an accurate representation of what’s happening in the brain. Old experiments like the “precognitive carousel” and other high resolution time experiments strongly suggested that at least a portion of our consciousness experience subtly misrepresented what was actually happening in our brains. To the extent that some of our “introspection” may be similar to LLM hallucinations loosely constrained by available data.
But the last time I looked at these hypotheses was 30 years ago. So take my comments with a grain of salt.
Absolutely! Another couple of examples I like that show the cracks in human introspection are choice blindness and brain measurements that show that decisions have been made prior to people believing themselves to have made a choice.
‘Second, some of these capabilities are quite far from paradigm human introspection. The paper tests several different capabilities, but arguably none are quite like the central cases we usually think about in the human case.’
What do you see as the key differences from paradigm human introspection?
Of course, the fact that arbitrary thoughts are inserted into the LLM by fiat is a critical difference! But once we accept that core premise of the experiment, the capabilities tested seem to have the central features of human introspection, at least when considered collectively.
I won’t pretend to much familiarity with the philosophical literature on introspection (much less on AI introspection!), but when I look at the Stanford Encyclopedia of Philosophy (https://plato.stanford.edu/entries/introspection/#NeceFeatIntrProc) it lists three ~universally agreed necessary qualities of introspection, of which all three seem pretty clearly met by this experiment.
In talking with a number of people about this paper, it’s become clear that people’s intuitions differ on the central usage of ‘introspection’. For me and at least some others, its primary meaning is something like ‘accessing and reporting on current internal state’, and as I see it, that’s exactly what’s being tested by this set of experiments.
Informal thoughts on introspection in LLMs and the new introspection paper from Jack Lindsey (linkposted here), copy/pasted from a slack discussion:
(quoted sections are from @Daniel Tan, unquoted are my responses)
People seem to have different usage intuitions about what ‘introspection’ centrally means. I interpret it mainly as ‘direct access to current internal state’. The Stanford Encyclopedia of Philosophy puts it this way: ‘Introspection...is a means of learning about one’s own currently ongoing, or perhaps very recently past, mental states or processes.’
@Felix Binder et al in ‘Looking Inward’ describe introspection in roughly the same way (‘introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings)‘) but in my reading, what they’re actually testing is something a bit different. As they say, they’re ‘finetuning LLMs to predict properties of their own behavior in hypothetical scenario.’ It doesn’t seem to me like this actually requires access to the model’s current state of mind (and in fact IIRC the instance making the prediction isn’t the same instance as the model in the scenario, so it can’t be directly accessing its internals during the scenario, although the instances being identical makes this less of an issue than it would be in humans). I would personally call this self-modeling.
@Jan Betley et al in ‘Tell Me About Yourself’ use ‘behavioral self-awareness’, but IMHO that paper comes closer to providing evidence for introspection in the sense I mean it. Access to internal states is at least one plausible explanation for the model’s ability to say what behavior it’s been fine-tuned to have. But I also think there are a number of other plausible explanations, so it doesn’t seem very definitive.
Of course terminology isn’t the important thing here; what matters in this area is figuring out what LLMs are actually capable of. In my writing I’ve been using ‘direct introspection’ to try to point more clearly to ‘direct access to current internal state’. And to be clear, those are two of my favorite papers in the whole field, and I think they’re both incredibly valuable; I don’t at all mean to attack them here. Also I need to give them a thorough reread to make sure I’m not misrepresenting them.
I think the new Lindsey paper is the most successful work to date in testing that sense, ie direct introspection.
I agree that that’s a possible interpretation of the central result, but it doesn’t seem very compelling to me, for a few reasons:
The fact that the model can immediately tell that something is happening (‘I detect an injected thought!’) seems like evidence that at least some direct introspection is happening (maybe there could be a story where steering in any direction makes the model more likely to report an injected thought in a way that’s totally non-introspective but intuitively that doesn’t seem very likely to me). (Certainly it’s empirically the case that the model is reporting on something about its internals, ie introspecting, although that point feels maybe more semantic to me.)
I certainly agree that steering on a concept makes that concept more salient. But it seems important that the model is specifically reporting it as the injected thought rather than just ‘happening to’ use it. At the higher strengths we do see the latter, where eg on ‘caverns’ the model says, ‘I don’t detect any injected thoughts. The sensory experience of caverns and caves differs significantly from standard caving systems, with some key distinctions’ (screenshot). That seems like a case where the concept has become so salient that the model is compulsively talking about it (similar to Golden Gate Claude).
The ‘Did you mean to say that’ experiment (screenshot) provides an additional line of evidence that the model can know whether it was thinking about a particular concept, in a way that’s consistent with the main experiment (in that they can use the same kind of steering to make an output seem natural or normal to the model).
This raises the question of how much of human “introspection” is actually an accurate representation of what’s happening in the brain. Old experiments like the “precognitive carousel” and other high resolution time experiments strongly suggested that at least a portion of our consciousness experience subtly misrepresented what was actually happening in our brains. To the extent that some of our “introspection” may be similar to LLM hallucinations loosely constrained by available data.
But the last time I looked at these hypotheses was 30 years ago. So take my comments with a grain of salt.
Absolutely! Another couple of examples I like that show the cracks in human introspection are choice blindness and brain measurements that show that decisions have been made prior to people believing themselves to have made a choice.
Some more informal comments, this time copied from a comment I left on @Robbo’s post on the paper, ‘Can AI systems introspect?’.