But there are like 10x more safety people looking into interpretability instead of how they generalize from data, as far as I can tell.)
An intriguing observation. But the ability to extrapolate accurately outside the training data is a result of building accurate world models. So to understand this, we’d need to understand the sorts of world models that LLMs build and how they interact. I’m having some difficulty immediately thinking of a way of studying that that doesn’t require first being a lot better at interpretability than we are now. But if you can think of one, I’d love to hear it.
I’m having some difficulty immediately thinking of a way of studying that
Pretty sure that’s not what 1a3orn would say, but you can study efficient world-models directly to grok that. Instead of learning about them through the intermediary of extant AIs, you can study the thing that these AIs are trying to ever-better approximate itself.
An intriguing observation. But the ability to extrapolate accurately outside the training data is a result of building accurate world models. So to understand this, we’d need to understand the sorts of world models that LLMs build and how they interact. I’m having some difficulty immediately thinking of a way of studying that that doesn’t require first being a lot better at interpretability than we are now. But if you can think of one, I’d love to hear it.
Pretty sure that’s not what 1a3orn would say, but you can study efficient world-models directly to grok that. Instead of learning about them through the intermediary of extant AIs, you can study the thing that these AIs are trying to ever-better approximate itself.
See my (somewhat outdated) post on the matter, plus the natural-abstractions agenda.