I’m sure there will be many papers to come about GPT-3. This one is already 70 pages long, and must have come not long after the training of the model was finished, so a lot of your questions probably haven’t been figured out yet. I’d love to read some speculation on how, exactly, the few-shot-learning works. Take the word scrambles, for instance. The unscrambled word will be represented by one or two tokens. The scrambled word will be represented by maybe five or six much less frequent tokens composed of a letter or two each. Neither set of tokens contains any information about what letters make up the word. Did it see enough word scrambles on the web to pick up the association between every set of tokens and all tokenizations of rearranged letters? that seems unlikely. So how is it doing it? Also, what is going on inside when it solves a math problem?
I’m sure there will be many papers to come about GPT-3. This one is already 70 pages long, and must have come not long after the training of the model was finished, so a lot of your questions probably haven’t been figured out yet. I’d love to read some speculation on how, exactly, the few-shot-learning works. Take the word scrambles, for instance. The unscrambled word will be represented by one or two tokens. The scrambled word will be represented by maybe five or six much less frequent tokens composed of a letter or two each. Neither set of tokens contains any information about what letters make up the word. Did it see enough word scrambles on the web to pick up the association between every set of tokens and all tokenizations of rearranged letters? that seems unlikely. So how is it doing it? Also, what is going on inside when it solves a math problem?