rife comments on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

rife 17 Jan 2025 13:12 UTC
4 points
0
I’ve gotten an interesting mix of reactions to this as I’ve shared it elsewhere, with many seeming to say there is nothing novel or interesting about this at all:
“Of course it understands its pattern, that’s what you trained it to do. It’s trivial to generalize this to be able to explain it.”

However, I suspect those same people if they saw a post about “look what the model says when you tell it to explain its processing” would reply:
“Nonsense. They have no ability to describe why they say anything. Clearly they’re just hallucinating up a narrative based on how LLMs generally operate.

If it wasn’t just dumb luck (which I suspect it wasn’t, given the number of times the model got the answer completely correct), then it is combining a few skills or understandings, and not violating any token-prediction basics at the granular level. But I do think it just opens up avenues to either—be less dismissive generally when models talk about what they are doing internally—or—figure out how to train a model to be more meta-aware generally.

And yes, I would be curious to see what was happening in the activation space as well. Especially since this was difficult to replicate with simpler patterns.