I wrote this post about a year ago. It now strikes me as an interesting mixture of
Ideas I still believe are true and important, and which are (still) not talked about enough
Ideas that were plausible at the time, but are much less so now
Claims I made for their aesthetic/emotional appeal, even though I did not fully believe them at the time
In category 1 (true, important, not talked about enough):
GPT-2 is a source of valuable evidence about linguistics, because it demonstrates various forms of linguistic competence that previously were only demonstrated by humans.
Much scholarly ink has been spilled over questions of the form “what would it take, computationally, to do X?”—where X is something GPT-2 can actually do. Since we now have a positive example, we should revisit these debates and determine which claims GPT-2 disproves, and which it supports.
Some of the key participants in those debates are not revisiting them in this way, and appear to think GPT-2 is entirely irrelevant to their work.
In category 2 (plausible then but not now):
“The structure of the transformer is somehow specially apt for language, relative to other architectures that were tried.”
I now think this is much less likely thanks to the 2 OpenAI scaling papers in 2020.
The first paper made it seem more plausible that LSTMs would behave like GPT-2 if given a much larger quantity of compute/data
The second paper showed that the things we know about transformers from the text domain generalize very well to image/video/math
I now think transformers are just a “good default architecture” for our current compute regime and may not have special linguistic properties
I’m finding this difficult to phrase, but in 2019 I think I believed Gary Marcus had similar preconceptions to me but was misreading the current evidence.
I now think he’s more committed to the idea that GPT-2-like approaches are fundamentally barking up the wrong tree, and will maintain this idea in the face of arbitrary feats of competence.
In category 3 (misleading):
I overstated the similarity between what Marcus wanted in 2001, and what has actually occurred.
I think Marcus wanted neural nets to be designed in a very top-down, constrained way, baking in lots of human prior knowledge.
ConvNets do bake in (a very simple kind of) prior knowledge.
But, though LSTMs and transformers are more “structured” than fully connected nets, the structure is not intended to encode prior knowledge.
Nothing in the recently successful architectures looks like the deliberate design, aimed at enforcing known linguistic regularities, that Marcus argued for.
I was aware of the vast gap between “more structure than the literal minimum possible” and “the kind of structure Marcus wanted,” but conflated the two. Possibly because I thought the resulting irony was appealing, and/or because it was suggested the disagreement was illusory and was thus emotionally appealing.
In sum, I still like the writing and humor in this post, and I think it makes some important observations, but I also think it leaves the reader with some importantly wrong impressions.
I think Gary Marcus wanted AI research to uncover lots of interesting rules like “in English, you make verbs past tense by adding -ed, except …” because he wants to know what the rules are, and because engineering following psycholinguistic research is much more appealing to him than the other way around. Machine learning (without interpretability) doesn’t give us any tools to learn what the rules are.
I wrote this post about a year ago. It now strikes me as an interesting mixture of
Ideas I still believe are true and important, and which are (still) not talked about enough
Ideas that were plausible at the time, but are much less so now
Claims I made for their aesthetic/emotional appeal, even though I did not fully believe them at the time
In category 1 (true, important, not talked about enough):
GPT-2 is a source of valuable evidence about linguistics, because it demonstrates various forms of linguistic competence that previously were only demonstrated by humans.
Much scholarly ink has been spilled over questions of the form “what would it take, computationally, to do X?”—where X is something GPT-2 can actually do. Since we now have a positive example, we should revisit these debates and determine which claims GPT-2 disproves, and which it supports.
Some of the key participants in those debates are not revisiting them in this way, and appear to think GPT-2 is entirely irrelevant to their work.
In category 2 (plausible then but not now):
“The structure of the transformer is somehow specially apt for language, relative to other architectures that were tried.”
I now think this is much less likely thanks to the 2 OpenAI scaling papers in 2020.
The first paper made it seem more plausible that LSTMs would behave like GPT-2 if given a much larger quantity of compute/data
The second paper showed that the things we know about transformers from the text domain generalize very well to image/video/math
I now think transformers are just a “good default architecture” for our current compute regime and may not have special linguistic properties
I’m finding this difficult to phrase, but in 2019 I think I believed Gary Marcus had similar preconceptions to me but was misreading the current evidence.
I now think he’s more committed to the idea that GPT-2-like approaches are fundamentally barking up the wrong tree, and will maintain this idea in the face of arbitrary feats of competence.
In category 3 (misleading):
I overstated the similarity between what Marcus wanted in 2001, and what has actually occurred.
I think Marcus wanted neural nets to be designed in a very top-down, constrained way, baking in lots of human prior knowledge.
ConvNets do bake in (a very simple kind of) prior knowledge.
But, though LSTMs and transformers are more “structured” than fully connected nets, the structure is not intended to encode prior knowledge.
Nothing in the recently successful architectures looks like the deliberate design, aimed at enforcing known linguistic regularities, that Marcus argued for.
I was aware of the vast gap between “more structure than the literal minimum possible” and “the kind of structure Marcus wanted,” but conflated the two. Possibly because I thought the resulting irony was appealing, and/or because it was suggested the disagreement was illusory and was thus emotionally appealing.
In sum, I still like the writing and humor in this post, and I think it makes some important observations, but I also think it leaves the reader with some importantly wrong impressions.
I think Gary Marcus wanted AI research to uncover lots of interesting rules like “in English, you make verbs past tense by adding -ed, except …” because he wants to know what the rules are, and because engineering following psycholinguistic research is much more appealing to him than the other way around. Machine learning (without interpretability) doesn’t give us any tools to learn what the rules are.
Maybe add a disclaimer at the start of the post?