Oh, there’s tons and tons of this kind of data online, I bet. Even GPT-3 could do ‘ELI5’, remember (and I wouldn’t be surprised if GPT-2 could too since it could do ‘tl;dr’). You have stuff like Simple English Wiki, you have centuries of children’s literature (which will often come with inline metadata like “Newberry Award winner” or “a beloved classic of children’s literature” or “recommended age range: 6-7yo”, you have children’s dictionaries (‘kid dictionary’, ‘student dictionary’, ‘dictionary for kids’, ‘elementary dictionary’), you will have lots of style parody text transfer examples where someone rewrites “X but if it were a children’s novel”, you have ‘young adult literature’ intermediate, textbook anthologies of writing aimed at specific grades, micro-genres like “Anglish” or “Up-Goer-Five” (the latter aimed partially at children)...
No, there’s nothing impressive or ‘generalizing’ about this. This is all well within-distribution.
If anything, rather than being surprisingly good, the given definitions seem kinda… insulting and bad and age-inappropriate and like ChatGPT is condescending rather than generating a useful pedagogically-age-appropriate definition? Here’s an actual dictionary-for-children defining ‘cat’: https://kids.wordsmyth.net/we/?rid=6468&ent_l=cat
a small, furry mammal with whiskers, short ears, and a long tail. Cats, also called house cats, are often kept as pets or to catch mice and rats.
any of the larger wild animals related to the kind of cat kept as a pet. Tigers, lions, and bobcats are all cats. Cats are carnivorous mammals.
Which is quite different from
Cat: A soft, furry friend that says “meow” and loves to play and cuddle.
(this is more of a pre-k or toddler level definition)
or 11yo:
Cat: Cats are furry animals with pointy ears, a cute nose, and a long tail. They like to nap a lot, chase things like strings or toys, and sometimes purr when they’re happy.
Which is, er… I was a precociously hyper-literate 11yo, as I expect most people reading LW were, but I’m pretty sure even my duller peers in 6th or 7th grade in middle school, when we were doing algebra and setting up school-sized exhibits about the Apollo space race and researching it in Encyclopedia Britannica & Encarta and starting to upgrade to the adult dictionaries and AIM chatting all hours, would’ve been insulted to be given a definition of ‘cat’ like that...
Indeed, and my point is that that seems entirely probable. He asked for a dictionary definition of words like ‘cat’ for children, and those absolutely exist online and are easy to find, and I gave an example of one for ‘cat’.
(And my secondary point was that ironically, you might argue that GPT is generalizing and not memorizing… because its definition is so bad compared to an actual Internet-corpus definition for children, and is bad in that instantly-recognizable ChatGPTese condescending talking-down bureaucrat smarm way. No human would ever define ‘cat’ for 11yos like that. If it was ‘just memorizing’, the definitions would be better.)
Whatever one means by “memorize” is by no means self-evident. If you prompt ChatGPT with “To be, or not to be,” it will return the whole soliloquy. Sometimes. Other times it will give you an opening chunk and then an explanation that that’s the well known soliloquy, etc. By poking around I discovered that I could elicit the soliloquy by giving it prompts that consisting of syntactically coherent phrases, but if I gave it prompts that were not syntactically coherent, it didn’t recognize the source, that is, until a bit more prompting. I’ve never found the idea that LLMs were just memorizing to be very plausible.
I was assuming lots of places widely spread. What I was curious about was a specific connection in the available data between the terms I used in my prompts and the levels of language. gwern’s comment satisfies that concern.
Of course, but it does need to know what a definition is. There are certainly lots of dictionaries on the web. I’m willing to assume that some of them made it into the training data. And it needs to know that people of different ages use language at different levels of detail and abstraction. I think that requires labeled data, like children’s stories labeled as such.
It doesn’t and the developers don’t label the data. The LLM learns that these categories exist during training because they can and it helps minimize the loss function.
By labeled data I simply mean that children’s stories are likely to be identified as such in the data. Children’s books are identified as children’s books. Otherwise, how is the model to “know” what language is appropriate for children? Without some link between the language and a certain class of people it’s just more text. My prompt specifies 5-year olds. How does the model connect that prompt with a specific kind of language?
I don’t think there are necessarily any specific examples in the training data. LLMs can generalize to text outside of the training distribution.
Oh, there’s tons and tons of this kind of data online, I bet. Even GPT-3 could do ‘ELI5’, remember (and I wouldn’t be surprised if GPT-2 could too since it could do ‘tl;dr’). You have stuff like Simple English Wiki, you have centuries of children’s literature (which will often come with inline metadata like “Newberry Award winner” or “a beloved classic of children’s literature” or “recommended age range: 6-7yo”, you have children’s dictionaries (‘kid dictionary’, ‘student dictionary’, ‘dictionary for kids’, ‘elementary dictionary’), you will have lots of style parody text transfer examples where someone rewrites “X but if it were a children’s novel”, you have ‘young adult literature’ intermediate, textbook anthologies of writing aimed at specific grades, micro-genres like “Anglish” or “Up-Goer-Five” (the latter aimed partially at children)...
No, there’s nothing impressive or ‘generalizing’ about this. This is all well within-distribution.
If anything, rather than being surprisingly good, the given definitions seem kinda… insulting and bad and age-inappropriate and like ChatGPT is condescending rather than generating a useful pedagogically-age-appropriate definition? Here’s an actual dictionary-for-children defining ‘cat’: https://kids.wordsmyth.net/we/?rid=6468&ent_l=cat
Which is quite different from
(this is more of a pre-k or toddler level definition)
or 11yo:
Which is, er… I was a precociously hyper-literate 11yo, as I expect most people reading LW were, but I’m pretty sure even my duller peers in 6th or 7th grade in middle school, when we were doing algebra and setting up school-sized exhibits about the Apollo space race and researching it in Encyclopedia Britannica & Encarta and starting to upgrade to the adult dictionaries and AIM chatting all hours, would’ve been insulted to be given a definition of ‘cat’ like that...
I assume OP thought that there was some specific place in the training data the LLM was replicating.
Indeed, and my point is that that seems entirely probable. He asked for a dictionary definition of words like ‘cat’ for children, and those absolutely exist online and are easy to find, and I gave an example of one for ‘cat’.
(And my secondary point was that ironically, you might argue that GPT is generalizing and not memorizing… because its definition is so bad compared to an actual Internet-corpus definition for children, and is bad in that instantly-recognizable ChatGPTese condescending talking-down bureaucrat smarm way. No human would ever define ‘cat’ for 11yos like that. If it was ‘just memorizing’, the definitions would be better.)
Whatever one means by “memorize” is by no means self-evident. If you prompt ChatGPT with “To be, or not to be,” it will return the whole soliloquy. Sometimes. Other times it will give you an opening chunk and then an explanation that that’s the well known soliloquy, etc. By poking around I discovered that I could elicit the soliloquy by giving it prompts that consisting of syntactically coherent phrases, but if I gave it prompts that were not syntactically coherent, it didn’t recognize the source, that is, until a bit more prompting. I’ve never found the idea that LLMs were just memorizing to be very plausible.
In any event, here’s a bunch of experiments explicitly aimed at memorizing, including the Hamlet soliloquy stuff: https://www.academia.edu/107318793/Discursive_Competence_in_ChatGPT_Part_2_Memory_for_Texts_Version_3
I was assuming lots of places widely spread. What I was curious about was a specific connection in the available data between the terms I used in my prompts and the levels of language. gwern’s comment satisfies that concern.
Of course, but it does need to know what a definition is. There are certainly lots of dictionaries on the web. I’m willing to assume that some of them made it into the training data. And it needs to know that people of different ages use language at different levels of detail and abstraction. I think that requires labeled data, like children’s stories labeled as such.
It doesn’t and the developers don’t label the data. The LLM learns that these categories exist during training because they can and it helps minimize the loss function.
By labeled data I simply mean that children’s stories are likely to be identified as such in the data. Children’s books are identified as children’s books. Otherwise, how is the model to “know” what language is appropriate for children? Without some link between the language and a certain class of people it’s just more text. My prompt specifies 5-year olds. How does the model connect that prompt with a specific kind of language?