Section 7.9 of Claude Mythos Preview System Card had Anthropic describe how Mythos generated novel puns and began to prefer particular philosophers, while the Opuses recycled puns found online. How plausible is it that novel OOD understanding levels do actually scale with the LLMs’ size?
I would probably consider “novel” puns to be within-distribution, even if not memorized puns.
But honestly, I think these examples are just generally hard to make sense of, since we don’t have access to their training setup or data (is it a type of pun interpolated across many languages? How much does it relate to true novelty in complex, long-horizon domains?). I could see scale being useful for interpolating these new puns while not necessarily being relevant to what is needed for ASI. Or, scale could actually be making progress towards these sorts of capabilities! It just seems overstated (at least pre-Mythos, which I can’t test), and I feel like it poisons research selection and experiment interpretation.
Scale is obviously helpful, but imo there is more nuance to it than lots of folks consider properly. I’m asking that we try to be more precise about all of this.
For example, I think Talkie-1930 (model trained pre-1930s) is a great example of generalization research (though yes, it does not say much about frontier scaling)! It helps us better understand generalization. But I saw implied claims that the model was able to ICL solve a Python problem, but when you look at the details of the experiment, the OOD generalization coding example feels dubious. From @Steven Byrnes (link to his post / my take):
I was surprised and puzzled by this, because I’m a general skeptic of (so-called) “in-context learning”—I generally say that LLMs have decent “understanding” of what’s in their weights, but quite sketchy & superficial “understanding” of stuff in the context window but NOT the weights. The context window can really only support “recognition” of things that the weights already “understand”. Or at least, that’s what I’ve been saying for years.
So how is Talkie-1930 doing any Python at all? I was puzzled.
…But then I looked at the example. All that Talkie did was exactly copy the example in context but switch “+ 5” to “– 5”. (And it got it right at least once given 100 tries!)
I can definitely imagine that a person who has read every pre-1930 book on symbolic logic, cryptography, and the rest of math (& jacquard looms etc.) could guess that answer, at a glance, given 100 tries, while remaining deeply deeply confused about wtf was going on in the code snippet, and while “understanding” zero Python (and zero code) in any real sense.
So I don’t think this one example gives me any new reason to change my mind about the (lack of) power of (so-called) in-context learning, or to be suspicious of data leakage, subliminal learning, etc. in Talkie-1930.
(Cool project, kudos to the authors.)
I feel like I see examples like this all the time! Often, I expect it because there’s some sort of bias towards trying to ‘warn the world about what is coming’, which leads people in AI safety to overstate such results and muddy our comprehension of what is happening.
Section 7.9 of Claude Mythos Preview System Card had Anthropic describe how Mythos generated novel puns and began to prefer particular philosophers, while the Opuses recycled puns found online. How plausible is it that novel OOD understanding levels do actually scale with the LLMs’ size?
I would probably consider “novel” puns to be within-distribution, even if not memorized puns.
But honestly, I think these examples are just generally hard to make sense of, since we don’t have access to their training setup or data (is it a type of pun interpolated across many languages? How much does it relate to true novelty in complex, long-horizon domains?). I could see scale being useful for interpolating these new puns while not necessarily being relevant to what is needed for ASI. Or, scale could actually be making progress towards these sorts of capabilities! It just seems overstated (at least pre-Mythos, which I can’t test), and I feel like it poisons research selection and experiment interpretation.
Scale is obviously helpful, but imo there is more nuance to it than lots of folks consider properly. I’m asking that we try to be more precise about all of this.
For example, I think Talkie-1930 (model trained pre-1930s) is a great example of generalization research (though yes, it does not say much about frontier scaling)! It helps us better understand generalization. But I saw implied claims that the model was able to ICL solve a Python problem, but when you look at the details of the experiment, the OOD generalization coding example feels dubious. From @Steven Byrnes (link to his post / my take):
I feel like I see examples like this all the time! Often, I expect it because there’s some sort of bias towards trying to ‘warn the world about what is coming’, which leads people in AI safety to overstate such results and muddy our comprehension of what is happening.