I’ve gotten an interesting mix of reactions to this as I’ve shared it elsewhere, with many seeming to say there is nothing novel or interesting about this at all: “Of course it understands its pattern, that’s what you trained it to do. It’s trivial to generalize this to be able to explain it.”
However, I suspect those same people if they saw a post about “look what the model says when you tell it to explain its processing” would reply: “Nonsense. They have no ability to describe why they say anything. Clearly they’re just hallucinating up a narrative based on how LLMs generally operate.
If it wasn’t just dumb luck (which I suspect it wasn’t, given the number of times the model got the answer completely correct), then it is combining a few skills or understandings, and not violating any token-prediction basics at the granular level. But I do think it just opens up avenues to either—be less dismissive generally when models talk about what they are doing internally—or—figure out how to train a model to be more meta-aware generally.
And yes, I would be curious to see what was happening in the activation space as well. Especially since this was difficult to replicate with simpler patterns.
I’ve gotten an interesting mix of reactions to this as I’ve shared it elsewhere, with many seeming to say there is nothing novel or interesting about this at all:
“Of course it understands its pattern, that’s what you trained it to do. It’s trivial to generalize this to be able to explain it.”
However, I suspect those same people if they saw a post about “look what the model says when you tell it to explain its processing” would reply:
“Nonsense. They have no ability to describe why they say anything. Clearly they’re just hallucinating up a narrative based on how LLMs generally operate.
If it wasn’t just dumb luck (which I suspect it wasn’t, given the number of times the model got the answer completely correct), then it is combining a few skills or understandings, and not violating any token-prediction basics at the granular level. But I do think it just opens up avenues to either—be less dismissive generally when models talk about what they are doing internally—or—figure out how to train a model to be more meta-aware generally.
And yes, I would be curious to see what was happening in the activation space as well. Especially since this was difficult to replicate with simpler patterns.