It is also assumed that introspection can be done entirely internally, without producing output. In transformers this means a single forward pass, as once tokens are generated and fed back into the model, it becomes difficult to distinguish direct knowledge of internal states from knowledge inferred about internal states from outputs.
If we make this a hard constraint, though, we rule out many possible kinds of legitimate LLM introspection (especially because the amount of cognition possible in a single forward pass is pretty limited, which is a sort of limitation that human cognition doesn’t face). I agree that it’s difficult to distinguish, but that should be seen as a limitation of our measurement techniques rather than a boundary on what ought to count as introspection.
As one example, although the state of the evidence seems a bit messy, papers like ‘Let’s Think Dot by Dot’ (https://arxiv.org/abs/2404.15758v1) show that LLMs can have cognitive processes that last over many tokens, in a way that isn’t drawing information from the outputs created during those processes.
Yeah, this is a good call-out: I’ve appealed to a technical difficulty, but made it part of my definition without a principled justification. This reflects my own ambivalence on the subject, and I’d welcome further thoughts/discussion. One the one hand, the SEP entry, and pretty much everything else on human/animal introspection/metacognition, talks about “internal” states being the object. If introspection means the same thing in biological and artificial intelligence, it’s hard to see how output tokens can “count”. OTOH, what if it turns out to be the case that actually when we introspect we produce micro-movements of the tongue or eye, say, and when those are experimentally inhibited we literally are unable to introspect (by the definition given above)? It seems like that wouldn’t actually require us to change our definition of introspection or our judgment that humans can introspect. But then imagine a human who, when asked, say, “Do you know where the dish is?” responds with “Cupboard. The dish is in the cupboard. Oh, I guess I do know where the dish is, yes”. And so on. It doesn’t seem right to call that person introspective; what they’re doing seems qualitatively different from what you and I would do, which is to refer to some memory of either seeing the dish somewhere or of evaluating our knowledge of the dish location, or to refer to some general subjective feeling of certainty/familiarity. OTOOH, that difference feels like it has more to do with phenomenal consciousness than the practical, strategically useful, safety-relevant definition of introspection I’ve highlighted here. But! Such a person would not hold the information asymmetry advantage that you and I do, in which we know more about our own states than an outside observer, and so if LLMs work like that it’s also not so concerning from a safety point of view. So maybe a good test of “practically dangerous” introspection would be if an observer model can predict from the “thinking” outputs what the final output will be, as in my hypothetical non-introspective human example, or not, as in your “dot by dot” example. So I guess I’m leaning towards thinking that no-outputs shouldn’t be a definitional requirement, and testing whether models can achieve introspective success in both non-thinking and thinking paradigms is informative, but the latter takes more work to rule out confounds. BTW, I just ran Opus 4.1 on a version of the Delegate Game after removing the strong admonishment to output only one token (and handling multi-token outputs) and find that, while it often does output introspective-seeming text, introspective performance doesn’t change (remaining very marginal at best).
If we make this a hard constraint, though, we rule out many possible kinds of legitimate LLM introspection (especially because the amount of cognition possible in a single forward pass is pretty limited, which is a sort of limitation that human cognition doesn’t face). I agree that it’s difficult to distinguish, but that should be seen as a limitation of our measurement techniques rather than a boundary on what ought to count as introspection.
As one example, although the state of the evidence seems a bit messy, papers like ‘Let’s Think Dot by Dot’ (https://arxiv.org/abs/2404.15758v1) show that LLMs can have cognitive processes that last over many tokens, in a way that isn’t drawing information from the outputs created during those processes.
Yeah, this is a good call-out: I’ve appealed to a technical difficulty, but made it part of my definition without a principled justification. This reflects my own ambivalence on the subject, and I’d welcome further thoughts/discussion. One the one hand, the SEP entry, and pretty much everything else on human/animal introspection/metacognition, talks about “internal” states being the object. If introspection means the same thing in biological and artificial intelligence, it’s hard to see how output tokens can “count”. OTOH, what if it turns out to be the case that actually when we introspect we produce micro-movements of the tongue or eye, say, and when those are experimentally inhibited we literally are unable to introspect (by the definition given above)? It seems like that wouldn’t actually require us to change our definition of introspection or our judgment that humans can introspect. But then imagine a human who, when asked, say, “Do you know where the dish is?” responds with “Cupboard. The dish is in the cupboard. Oh, I guess I do know where the dish is, yes”. And so on. It doesn’t seem right to call that person introspective; what they’re doing seems qualitatively different from what you and I would do, which is to refer to some memory of either seeing the dish somewhere or of evaluating our knowledge of the dish location, or to refer to some general subjective feeling of certainty/familiarity. OTOOH, that difference feels like it has more to do with phenomenal consciousness than the practical, strategically useful, safety-relevant definition of introspection I’ve highlighted here. But! Such a person would not hold the information asymmetry advantage that you and I do, in which we know more about our own states than an outside observer, and so if LLMs work like that it’s also not so concerning from a safety point of view. So maybe a good test of “practically dangerous” introspection would be if an observer model can predict from the “thinking” outputs what the final output will be, as in my hypothetical non-introspective human example, or not, as in your “dot by dot” example. So I guess I’m leaning towards thinking that no-outputs shouldn’t be a definitional requirement, and testing whether models can achieve introspective success in both non-thinking and thinking paradigms is informative, but the latter takes more work to rule out confounds. BTW, I just ran Opus 4.1 on a version of the Delegate Game after removing the strong admonishment to output only one token (and handling multi-token outputs) and find that, while it often does output introspective-seeming text, introspective performance doesn’t change (remaining very marginal at best).