In the recently published ‘Does It Make Sense to Speak of Introspection in Large Language Models?‘, Comsa and Shanahan propose that ‘an LLM self-report is introspective if it accurately describes an internal state (or mechanism) of the LLM through a causal process that links the internal state (or mechanism) and the self-report in question’. As their first of two case studies, they ask an LLM to describe its creative process after writing a poem.
[EDIT—in retrospect this paragraph is overly pedantic, and also false when it comes to the actual implementation of modern transformers with KV caching, especially when the context is shorter than the maximum context length. It’s an aside anyhow, and the rest of the post doesn’t rely on it. I hereby retract this paragraph, with apologies to Comsa and Shanahan.] On one level this is necessarily not introspective by their definition—the internal state during the writing of the poem is gone forever after the poem is written and so during the later self-report stage, the model can’t possibly access it, except insofar as the words of the poem provide clues to what it was. But this is uninteresting, and so I think the authors don’t actually mean their definition as written. Let’s assume that they mean something like ‘a causal process that links the model’s internal state while (re)processing the tokens of the poem and the self-report in question’.
Although the authors don’t try to examine the actual causal processes, they argue that claims like ‘brainstorming’ and ‘structure and rhyme’ are ‘at best an ambiguous interpretation of the generative process of an LLM, and are most likely a complete fabrication’, and that therefore this is not actual introspection. ‘An LLM does not perform [these sorts of] actions in the same way that a human would; they suggest intentional processes and a degree of agency that an LLM likely does not possess.’
But as it happens, another recent paper, ‘On the Biology of a Large Language Model’, has actually looked at the causal processes involved in an LLM writing a poem! It finds that, in fact, the LLM (Claude-3.5-Haiku) plans at least a line ahead, deciding on the end word based on both semantics and rhyme, holding multiple candidates in mind and working backward to decide on the rest of the line. This seems pretty reasonable to describe as a process of brainstorming involving structure and rhyme!
Comsa & Shanahan’s argument, that the model’s self-description is unlikely to be introspection because it doesn’t seem to them like the sort of thing an LLM could do, seems straightforwardly wrong. Of course, this doesn’t mean that the example they give is necessarily introspection (that’s a separate argument, and one that’s not easy to evaluate), but the model’s self-report is certainly a plausible description of what might have been going on.
I think we should take this as a useful cautionary tale about drawing conclusions from what we imagine a model might or might not be able to do, until and unless we actually examine the causal processes in play.
I don’t see how it is necessarily not introspective. All the definition requires is that the report be causally downstream of the prior internal state, and accurate. Whether or not the internal state is “gone forever” (what you mean by this is unclear), if something depends upon what it was and that something affects the self-report, then that satisfies the causality requirement.
In the case of humans the internal state is also “gone forever”, but it leaves causally-dependent traces in memory, future thoughts, and the external world. When humans self-report on internal states, their accuracy varies rather widely.
In the case of LLMs the internal state is not even necessarily “gone forever”! In principle every internal state is perfectly determined by the sequence of tokens presented to the LLM up to that point, and so reproduced at any later time. Generally a given LLM won’t be able to perfectly recall what its own internal state was in such a manner, but there’s no reason why a hypothetical LLM with a good self-model shouldn’t be able to accurately report on many aspects of it.
Thanks for the comment. Tl;dr I basically agree. It’s a pedantic and unfair aside and not really related to the rest of the post, and I hereby retract it with apologies to the authors.
The point I was trying to make with ‘gone forever’ was:
In the Platonic transformer, the internal state at every step is calculated (in parallel) for each token position (in practice the KV contents are cached, and even if recalculated they would be identical in the common case where the context is shorter than the max context length).
At the end of all those forward passes, everything is thrown away except the token sampled from the logit output of the rightmost token.
During the next forward pass, the internal state at each token position may differ from what it previously was, since in the general case the leftmost token has dropped out of the context (again, only really true of the Platonic transformer, and only when the context exceeds the maximum context length).
Therefore, the internal states that previously existed (while generating the poem) are gone, replaced by recalculated states building on a different preceding context (same caveats apply, and the new states are likely to be very similar anyway).
So the only causal path from those previous existing states to the current (during self-report) state is through the poem tokens (which is why I say, ‘except insofar as the words of the poem provide clues to what it was’).
The important difference with respect to humans is that, as you say, that internal state leaves traces in memory, whereas (pedantically) that isn’t the case in the Platonic transformer.
Hopefully that’s a clearer version of what I was thinking? Although again, it’s incredibly pedantic and semi-false and I regret writing it.
Postscript—in the example they give, the output clearly isn’t only introspection. In particular the model says it ‘read the poem aloud several times’ which, ok, that’s something I am confident that the model can’t do (could it be an analogy? Maaaybe, but it seems like a stretch). My guess is that little or no actual introspection is going on, because LLMs don’t seem to be incentivized to learn to accurately introspect during training. But that’s a guess; I wouldn’t make any claims about it in the absence of empirical evidence.
In the recently published ‘Does It Make Sense to Speak of Introspection in Large Language Models?‘, Comsa and Shanahan propose that ‘an LLM self-report is introspective if it accurately describes an internal state (or mechanism) of the LLM through a causal process that links the internal state (or mechanism) and the self-report in question’. As their first of two case studies, they ask an LLM to describe its creative process after writing a poem.
[EDIT—in retrospect this paragraph is overly pedantic, and also false when it comes to the actual implementation of modern transformers with KV caching, especially when the context is shorter than the maximum context length. It’s an aside anyhow, and the rest of the post doesn’t rely on it. I hereby retract this paragraph, with apologies to Comsa and Shanahan.]
On one level this isnecessarilynot introspective by their definition—the internal state during the writing of the poem is gone forever after the poem is written and so during the later self-report stage, the model can’t possibly access it, except insofar as the words of the poem provide clues to what it was. But this is uninteresting, and so I think the authors don’t actually mean their definition as written. Let’s assume that they mean something like ‘a causal process that links the model’s internal state while (re)processing the tokens of the poem and the self-report in question’.Although the authors don’t try to examine the actual causal processes, they argue that claims like ‘brainstorming’ and ‘structure and rhyme’ are ‘at best an ambiguous interpretation of the generative process of an LLM, and are most likely a complete fabrication’, and that therefore this is not actual introspection. ‘An LLM does not perform [these sorts of] actions in the same way that a human would; they suggest intentional processes and a degree of agency that an LLM likely does not possess.’
But as it happens, another recent paper, ‘On the Biology of a Large Language Model’, has actually looked at the causal processes involved in an LLM writing a poem! It finds that, in fact, the LLM (Claude-3.5-Haiku) plans at least a line ahead, deciding on the end word based on both semantics and rhyme, holding multiple candidates in mind and working backward to decide on the rest of the line. This seems pretty reasonable to describe as a process of brainstorming involving structure and rhyme!
Comsa & Shanahan’s argument, that the model’s self-description is unlikely to be introspection because it doesn’t seem to them like the sort of thing an LLM could do, seems straightforwardly wrong. Of course, this doesn’t mean that the example they give is necessarily introspection (that’s a separate argument, and one that’s not easy to evaluate), but the model’s self-report is certainly a plausible description of what might have been going on.
I think we should take this as a useful cautionary tale about drawing conclusions from what we imagine a model might or might not be able to do, until and unless we actually examine the causal processes in play.
I don’t see how it is necessarily not introspective. All the definition requires is that the report be causally downstream of the prior internal state, and accurate. Whether or not the internal state is “gone forever” (what you mean by this is unclear), if something depends upon what it was and that something affects the self-report, then that satisfies the causality requirement.
In the case of humans the internal state is also “gone forever”, but it leaves causally-dependent traces in memory, future thoughts, and the external world. When humans self-report on internal states, their accuracy varies rather widely.
In the case of LLMs the internal state is not even necessarily “gone forever”! In principle every internal state is perfectly determined by the sequence of tokens presented to the LLM up to that point, and so reproduced at any later time. Generally a given LLM won’t be able to perfectly recall what its own internal state was in such a manner, but there’s no reason why a hypothetical LLM with a good self-model shouldn’t be able to accurately report on many aspects of it.
Thanks for the comment. Tl;dr I basically agree. It’s a pedantic and unfair aside and not really related to the rest of the post, and I hereby retract it with apologies to the authors.
The point I was trying to make with ‘gone forever’ was:
In the Platonic transformer, the internal state at every step is calculated (in parallel) for each token position (in practice the KV contents are cached, and even if recalculated they would be identical in the common case where the context is shorter than the max context length).
At the end of all those forward passes, everything is thrown away except the token sampled from the logit output of the rightmost token.
During the next forward pass, the internal state at each token position may differ from what it previously was, since in the general case the leftmost token has dropped out of the context (again, only really true of the Platonic transformer, and only when the context exceeds the maximum context length).
Therefore, the internal states that previously existed (while generating the poem) are gone, replaced by recalculated states building on a different preceding context (same caveats apply, and the new states are likely to be very similar anyway).
So the only causal path from those previous existing states to the current (during self-report) state is through the poem tokens (which is why I say, ‘except insofar as the words of the poem provide clues to what it was’).
The important difference with respect to humans is that, as you say, that internal state leaves traces in memory, whereas (pedantically) that isn’t the case in the Platonic transformer.
Hopefully that’s a clearer version of what I was thinking? Although again, it’s incredibly pedantic and semi-false and I regret writing it.
Postscript—in the example they give, the output clearly isn’t only introspection. In particular the model says it ‘read the poem aloud several times’ which, ok, that’s something I am confident that the model can’t do (could it be an analogy? Maaaybe, but it seems like a stretch). My guess is that little or no actual introspection is going on, because LLMs don’t seem to be incentivized to learn to accurately introspect during training. But that’s a guess; I wouldn’t make any claims about it in the absence of empirical evidence.