Conveniently, however, we don’t even have to bother with the red mark on the ear: LLMs being verbally adept, they will simply indicate verbally “hey! that’s what I just wrote!”
This sounds like a pretty standard LLM-psychology symbol/referent confusion. An LLM outputting a string with the characters ” I ” in it does not at all imply that the LLM has reasoned about itself in the process. In a base model, for instance, the LLM outputting “hey! that’s what I just wrote!” implies that the input text would typically be followed in the training corpus by text like “hey! that’s what I just wrote!”; such a pattern does not necessarily require any degree of self awareness in order to learn.
Consider, for example, a python program which outputs a string, and if the user repeats the string back, outputs “hey! that’s what I just wrote!”. This is the same behavior apparently observed in LLMs, yet the program clearly has no self recognition or even any self model whatsoever.
An LLM could easily learn much the same pattern (i.e. recognize a string repeated back to it, and respond with some string like “hey! that’s what I just wrote!”) from data + RL, again without any self model whatsoever. “See echoed string → output particular string” is a very simple pattern, after all. Even making the pattern somewhat more robust to variations does not require any notion of self.
Now, I don’t necessarily disagree that the Davidson et al test described sounds over constrained and not particularly well suited to the problem at hand, but one does need to somehow distinguish actual self recognition from merely responding to echoed strings with a string which happens to include the characters ” I ”.
Ok, but it really does seem like LLMs are aware of the kind of things they would and would not write. Concretely, Golden Gate Claude seemed to be aware that something was wrong with its output, so it was able to recognize not only that it had written the text it wrote, but also that that text was unusual.
Golden Gate Claude tries to bake a cake without thinking of bridges (from @ElytraMithra’s twitter)
I suppose you could make the argument that that doesn’t mean that the LLM has necessarily learned a “self” concept, just that there is a character that sometimes goes by “Claude” and sometimes goes by “I” and which speaks in specific distinguishable contexts, and that the “Claude”/”I” character can recognize its own outputs, but that the LLM doesn’t know that it is the same character as “Claude”/”I”… but you could make similar arguments about humans.
Thanks for engaging! But you’re arguing against claims I didn’t make. I wrote about self-recognition (behavioral mirror test), not self-awareness or self-models.
All learning is pattern matching, but what matters is the spontaneous emergence of this specific capability: they learned to recognize their outputs without being explicitly programmed for this task. Would we reject chimp self-recognition because they learned it through neural pattern matching? Likewise, since humans recognize faces through pattern matching in the fusiform gyrus—does that mean we don’t ‘really’ recognize our mothers? I’m puzzled why we’d apply standards to AIs that would invalidate virtually all animal cognition research.
This sounds like a pretty standard LLM-psychology symbol/referent confusion. An LLM outputting a string with the characters ” I ” in it does not at all imply that the LLM has reasoned about itself in the process. In a base model, for instance, the LLM outputting “hey! that’s what I just wrote!” implies that the input text would typically be followed in the training corpus by text like “hey! that’s what I just wrote!”; such a pattern does not necessarily require any degree of self awareness in order to learn.
Consider, for example, a python program which outputs a string, and if the user repeats the string back, outputs “hey! that’s what I just wrote!”. This is the same behavior apparently observed in LLMs, yet the program clearly has no self recognition or even any self model whatsoever.
An LLM could easily learn much the same pattern (i.e. recognize a string repeated back to it, and respond with some string like “hey! that’s what I just wrote!”) from data + RL, again without any self model whatsoever. “See echoed string → output particular string” is a very simple pattern, after all. Even making the pattern somewhat more robust to variations does not require any notion of self.
Now, I don’t necessarily disagree that the Davidson et al test described sounds over constrained and not particularly well suited to the problem at hand, but one does need to somehow distinguish actual self recognition from merely responding to echoed strings with a string which happens to include the characters ” I ”.
Ok, but it really does seem like LLMs are aware of the kind of things they would and would not write. Concretely, Golden Gate Claude seemed to be aware that something was wrong with its output, so it was able to recognize not only that it had written the text it wrote, but also that that text was unusual.
Golden Gate Claude tries to bake a cake without thinking of bridges (from @ElytraMithra’s twitter)
I suppose you could make the argument that that doesn’t mean that the LLM has necessarily learned a “self” concept, just that there is a character that sometimes goes by “Claude” and sometimes goes by “I” and which speaks in specific distinguishable contexts, and that the “Claude”/”I” character can recognize its own outputs, but that the LLM doesn’t know that it is the same character as “Claude”/”I”… but you could make similar arguments about humans.
Thanks for engaging! But you’re arguing against claims I didn’t make. I wrote about self-recognition (behavioral mirror test), not self-awareness or self-models.
All learning is pattern matching, but what matters is the spontaneous emergence of this specific capability: they learned to recognize their outputs without being explicitly programmed for this task. Would we reject chimp self-recognition because they learned it through neural pattern matching? Likewise, since humans recognize faces through pattern matching in the fusiform gyrus—does that mean we don’t ‘really’ recognize our mothers? I’m puzzled why we’d apply standards to AIs that would invalidate virtually all animal cognition research.