I don’t see how it is necessarily not introspective. All the definition requires is that the report be causally downstream of the prior internal state, and accurate. Whether or not the internal state is “gone forever” (what you mean by this is unclear), if something depends upon what it was and that something affects the self-report, then that satisfies the causality requirement.
In the case of humans the internal state is also “gone forever”, but it leaves causally-dependent traces in memory, future thoughts, and the external world. When humans self-report on internal states, their accuracy varies rather widely.
In the case of LLMs the internal state is not even necessarily “gone forever”! In principle every internal state is perfectly determined by the sequence of tokens presented to the LLM up to that point, and so reproduced at any later time. Generally a given LLM won’t be able to perfectly recall what its own internal state was in such a manner, but there’s no reason why a hypothetical LLM with a good self-model shouldn’t be able to accurately report on many aspects of it.
Thanks for the comment. Tl;dr I basically agree. It’s a pedantic and unfair aside and not really related to the rest of the post, and I hereby retract it with apologies to the authors.
The point I was trying to make with ‘gone forever’ was:
In the Platonic transformer, the internal state at every step is calculated (in parallel) for each token position (in practice the KV contents are cached, and even if recalculated they would be identical in the common case where the context is shorter than the max context length).
At the end of all those forward passes, everything is thrown away except the token sampled from the logit output of the rightmost token.
During the next forward pass, the internal state at each token position may differ from what it previously was, since in the general case the leftmost token has dropped out of the context (again, only really true of the Platonic transformer, and only when the context exceeds the maximum context length).
Therefore, the internal states that previously existed (while generating the poem) are gone, replaced by recalculated states building on a different preceding context (same caveats apply, and the new states are likely to be very similar anyway).
So the only causal path from those previous existing states to the current (during self-report) state is through the poem tokens (which is why I say, ‘except insofar as the words of the poem provide clues to what it was’).
The important difference with respect to humans is that, as you say, that internal state leaves traces in memory, whereas (pedantically) that isn’t the case in the Platonic transformer.
Hopefully that’s a clearer version of what I was thinking? Although again, it’s incredibly pedantic and semi-false and I regret writing it.
I don’t see how it is necessarily not introspective. All the definition requires is that the report be causally downstream of the prior internal state, and accurate. Whether or not the internal state is “gone forever” (what you mean by this is unclear), if something depends upon what it was and that something affects the self-report, then that satisfies the causality requirement.
In the case of humans the internal state is also “gone forever”, but it leaves causally-dependent traces in memory, future thoughts, and the external world. When humans self-report on internal states, their accuracy varies rather widely.
In the case of LLMs the internal state is not even necessarily “gone forever”! In principle every internal state is perfectly determined by the sequence of tokens presented to the LLM up to that point, and so reproduced at any later time. Generally a given LLM won’t be able to perfectly recall what its own internal state was in such a manner, but there’s no reason why a hypothetical LLM with a good self-model shouldn’t be able to accurately report on many aspects of it.
Thanks for the comment. Tl;dr I basically agree. It’s a pedantic and unfair aside and not really related to the rest of the post, and I hereby retract it with apologies to the authors.
The point I was trying to make with ‘gone forever’ was:
In the Platonic transformer, the internal state at every step is calculated (in parallel) for each token position (in practice the KV contents are cached, and even if recalculated they would be identical in the common case where the context is shorter than the max context length).
At the end of all those forward passes, everything is thrown away except the token sampled from the logit output of the rightmost token.
During the next forward pass, the internal state at each token position may differ from what it previously was, since in the general case the leftmost token has dropped out of the context (again, only really true of the Platonic transformer, and only when the context exceeds the maximum context length).
Therefore, the internal states that previously existed (while generating the poem) are gone, replaced by recalculated states building on a different preceding context (same caveats apply, and the new states are likely to be very similar anyway).
So the only causal path from those previous existing states to the current (during self-report) state is through the poem tokens (which is why I say, ‘except insofar as the words of the poem provide clues to what it was’).
The important difference with respect to humans is that, as you say, that internal state leaves traces in memory, whereas (pedantically) that isn’t the case in the Platonic transformer.
Hopefully that’s a clearer version of what I was thinking? Although again, it’s incredibly pedantic and semi-false and I regret writing it.