IZ BELTAGY(lead research scientist, Allen Institute for AI; chief scientist and co-founder, SpiffyAI): In a day, a lot of the problems that a large percentage of researchers were working on — they just disappeared. …
R. THOMAS MCCOY: It’s reasonably common for a specific research project to get scooped or be eliminated by someone else’s similar thing. But ChatGPT did that to entire types of research, not just specific projects. A lot of higher categories of NLP just became no longer interesting — or no longer practical — for academics to do. …
IZ BELTAGY: I sensed that dread and confusion during EMNLP [Empirical Methods in Natural Language Processing], which is one of the leading conferences. It happened in December, a week after the release of ChatGPT. Everybody was still shocked: “Is this going to be the last NLP conference?” This is actually a literal phrase that someone said. During lunches and cocktails and conversations in the halls, everybody was asking the same question: “What is there that we can work on?”
NAZNEEN RAJANI(founder and CEO, Collinear AI; at the time a Ph.D. student studying with Ray Mooney): I had just given a keynote at EMNLP. A few days after that, Thom Wolf, who was my manager at Hugging Face and also one of the co-founders, messages me, “Hey, can you get on a call with me ASAP?” He told me that they had fired people from the research team and that the rest would either be doing pre-training or post-training — which means that you are either building a foundation model or you’re taking a foundation model and making it an instruction-following model, similar to ChatGPT. And he said, “I recommend you pick one of these two if you want to continue at Hugging Face.”
It didn’t feel like what the Hugging Face culture stood for. Until then, everyone was basically just doing their own research, what they wanted to do. It definitely felt not so good.
CHRISTOPHER CALLISON-BURCH: It helps to have tenure when something like this happens. But younger people were going through this crisis in a more visceral way. Some Ph.D. students literally formed support groups for each other.
LIAM DUGAN: We just kind of commiserated. A lot of Ph.D. students that were further on than me, that had started dissertation work, really had to pivot hard. A lot of these research directions, it’s like there’s nothing intellectual about them left. It’s just, apply the language model and it’s done.
Weirdly enough, nobody [I knew] quit. But there was a bit of quiet quitting. Just kind of dragging your feet or getting very cynical.
Wow. I knew academics were behind / out of the loop / etc. but this surprised me. I imagine these researchers had at least heard about GPT2 and GPT3 and the scaling laws papers; I wonder what they thought of them at the time. I wonder what they think now about what they thought at the time.
for anyone not wanting to go in and see the Kafka, I copied some useful examples:
ANNA ROGERS: I was considering making yet another benchmark, but I stopped seeing the point of it. Let’s say GPT-3 either can or cannot continue [generating] these streams of characters. This tells me something about GPT-3, but that’s not actually even a machine learning research question. It’s product testing for free.
JULIAN MICHAEL: There was this term, “API science,’’ that people would use to be like: “We’re doing science on a product? This isn’t science, it’s not reproducible.” And other people were like: “Look, we need to be on the frontier. This is what’s there.”
TAL LINZEN (associate professor of linguistics and data science, New York University; research scientist, Google): For a while people in academia weren’t really sure what to do.
R. THOMAS MCCOY: Are you pro- or anti-LLM? That was in the water very, very much at this time.
JULIE KALLINI (second-year computer science Ph.D. student, Stanford University): As a young researcher, I definitely sensed that there were sides. At the time, I was an undergraduate at Princeton University. I remember distinctly that different people I looked up to — my Princeton research adviser [Christiane Fellbaum] versus professors at other universities — were on different sides. I didn’t know what side to be on.
LIAM DUGAN: You got to see the breakdown of the whole field — the sides coalescing. The linguistic side was not very trusting of raw LLM technology. There’s a side that’s sort of in the middle. And then there’s a completely crazy side that really believed that scaling was going to get us to general intelligence. At the time, I just brushed them off. And then ChatGPT comes out.
+1, GPT3.5 was publicly available since January, and GPT3 was big news two years before and publicly available back then. I’m really surprised that people didn’t understand that these models were a big deal AND changed their minds when ChatGPT came out. Maybe it’s just a weird preference cascade, where this was enough to break a common false belief?
I remember seeing the ChatGPT announcement and not being particularly impressed or excited, like “okay, it’s a refined version of InstructGPT from almost a year ago. It’s cool that there’s a web UI now, maybe I’ll try it out soon.” November 2022 was a technological advancement but not a huge shift compared to January 2022 IMO
Fair enough. My mental image of the GPT models was stuck on that infernal “talking unicorns” prompt, which I think did make them seem reasonably characterized as mere “stochastic parrots” and “glorified autocompletes,” and the obvious bullshit about the “safety and security concerns” around releasing GPT-2 also led me to conclude the tech was unlikely to amount to much more. InstructGPT wasn’t good enough to get me to update it; that took the much-hyped ChatGPT release.
Was there a particular moment that impressed you, or did you just see the Transformers paper, project that correctly into the future, and the releases that followed since then have just been following that trend you extrapolated and so been unremarkable?
I remember being very impressed by GPT-2. I think I was also quite impressed by GPT-3 even though it was basically just “GPT-2 but better.” To be fair, at the moment that I was feeling unimpressed by ChatGPT, I don’t think I had actually used it yet. It did turn out to be much more useful to me than the GPT-3 API, which I tried out but didn’t find that many uses for.
It’s hard to remember exactly how impressed I was with ChatGPT after using it for a while. I think I hadn’t fully realized how great it could be when the friction of using the API was removed, even if I didn’t update that much on the technical advancement.
The full article discusses the transformer paper (which didn’t have a large influence, as the implications weren’t clear), BERT (which did have a large influence) and GPT-3 (which also had a large influence). I assume the release of ChatGPT was the point where even the last NLP researchers couldn’t ignore LLMs anymore.
ChatGPT was “so good they can’t ignore you”; the Hugging Face anecdote is particularly telling. At some point, everyone else gets tired of waiting for your cargo to land, and will fire you if you don’t get with the program. “You say semantics can never be learned from syntax and you’ve proven that ChatGPT can never be useful? It seems plenty useful to me and everyone else. Figure it out or we’ll find someone who can.”
I think the most interesting part of the Quanta piece is the discussion of the octopus paper, which states that pure language models can’t actually understand text (as they only learn from form/syntax), and the bitter disputes that followed in the NLP community.
From the abstract:
The success of the large neural language models on many NLP tasks is exciting. However, we find that these successes sometimes lead to hype in which these models are being described as “understanding” language or capturing “meaning”. In this position paper, we argue that a system trained only on form has a priori no way to learn meaning. In keeping with the ACL 2020 theme of “Taking Stock of Where We’ve Been and Where We’re Going”, we argue that a clear understanding of the distinction between form and meaning will help guide the field towards better science around natural language understanding.
(As a side note, Yudkowsky’s broadly verificationist theory of content seems to agree with her distinction: if “understanding” of a statement is knowing what experience would confirm it, or what experience it would predict, then understanding cannot come from syntactic form alone. The association of words and sensory data would be necessary. Did Yudkowsky ever comment on the apparent incompatibility between evident LLM understanding and his anticipated experience theory?)
Of course I assume that now it can hardly be denied that LLMs really do somehow understand text, even if they are merely trained on form. So the octopus paper argument must be wrong somewhere. Though at least in the Quanta piece, Bender doesn’t acknowledge any update of that sort. In fact, in the last quote she says:
I have seen an enormous shift towards end-to-end solutions using chatbots or related synthetic text-extruding machines. And I believe it to be a dead end.
I don’t think there’s any necessary contradiction. Verification or prediction of what? More data. What data? Data. You seem to think there’s some sort of special reality-fluid which JPEGs or MP3s have but .txt files do not, but they don’t; they all share the Buddha-nature.
Consider Bender’s octopus example, where she says that it can’t learn to do anything from watching messages go back and forth. This is obviously false, because we do this all the time; for example, you can teach a LLM to play good chess simply by watching a lot of moves fly by back and forth as people play postal chess. Imitation learning & offline RL are important use-cases of RL and no one would claim it doesn’t work or is impossible in principle.
Can you make predictions and statements which can be verified by watching postal chess games? Of course. Just predict what the next move will be. “I think he will castle, instead of moving the knight.” [later] “Oh no, I was wrong! I anticipated seeing a castling move, and I did not, I saw something else. My beliefs about castling did not pay rent and were not verified by subsequent observations of this game. I will update my priors and do better next time.”
Well, in the chess example we do not have any obvious map/territory relation. Chess seems to be a purely formal game, as the pieces do not seem to refer to anything in the external world. So it’s much less obvious that training on form alone would also work for learning natural language, which does exhibit a map territory distinction.
For example, a few years ago, most people would have regarded it as highly unlikely that you could understand (decode) an intercepted alien message without any contextual information. But if you can understand text from form alone, as LLMs seem to prove, the message simply has to be long enough. Then you can train an LLM on it, which would then be able to understand the message. And it would also be able to translate it into English if it is additionally trained on English text.
That’s very counterintuive, or at least it was counterintuitive until recently. I doubt EY meant to count raw words as “anticipated experience”, since “experience” typically refers to sensory data only. (In fact, I think Guessing the Teacher’s Password also suggests that he didn’t.)
To repeat, I don’t blame him, as the proposition that large amounts of raw text can replace sensory data, that a sufficient amount of symbols can ground themselves, was broadly considered unlikely until LLMs came along. But I do blame Bender insofar as she didn’t update even in light of strong evidence that the classical hypothesis (you can’t infer meaning from form alone) was wrong.
Well, in the chess example we do not have any obvious map/territory relation.
Yes, there is. The transcripts are of 10 million games that real humans played to cover the distribution of real games, and then were annotated by Stockfish, to provide superhuman-quality metadata on good vs bad moves. That is the territory. The map is the set of transcripts.
But if you can understand text from form alone, as LLMs seem to prove, the message simply has to be long enough.
I would say ‘diverse enough’, not ‘long enough’. (An encyclopedia will teach a LLM many things; a dictionary the same length, probably not.) Similar to meta-learning vs learning.
the pieces do not seem to refer to anything in the external world.
What external world does our ‘external world’ itself refer to things inside of? If the ‘external world’ doesn’t need its own external world for grounding, then why does lots of text about the external world not suffice? (And if it does, what grounds that external external world, or where does the regress end?) As I like to put it, for an LLM, ‘reality’ is just the largest fictional setting—the one that encompasses all the other fictional settings it reads about from time to time.
As someone who doubtless does quite a lot of reading about things or writing to people you have never seen nor met in real life and have no ‘sensory’ way of knowing that they exist, this is a position you should find sympathetic.
Sympathy or not, the position that meaning of natural language can be inferred from the symbolic form alone wasn’t obvious to me in the past, as this is certainly not how humans learn language, and I don’t know any evidence that someone else thought this plausible before machine learning made it evident. It’s always easy to make something sound obvious after the fact, but that doesn’t mean that it actually was obvious to anyone at the time.
Plenty of linguists and connectionists thought it was possible, if only to show those damned Chomskyans that they were wrong!
To be specific, some of the radical linguists believed in pure distributional semantics, or that there is no semantics beyond syntax. I don’t know anyone in particular, but considering how often Chomsky, Pinker, etc were fighting against the “blank slate” theory, they definitely existed.
The following people likely believed that it is possible to learn a language purely from reading using a general learning architecture like neural networks (blank-slate):
James L. McClelland and David Rumelhart.
They were the main proponents of neural networks in the “past tense debate”. Generally, anyone on the side of neural networks in the past tense debate probably believed this.
B. F. Skinner.
Radical syntacticians? Linguists have failed to settle the question of “Just what is semantics? How is it different from syntax?”, and some linguists have taken the radical position “There is no semantics. Everything is syntax.”. Once that is done, there simply is no difficulty: just learn all the syntax, and there is nothing left to learn.
Possibly some of the participants in the “linguistics wars” believed in it. Specifically, some believed in “generative semantics”, whereby semantics is simply yet more generative grammar, and thus not any different from syntax (also generative grammar). Chomsky, as you might imagine, hated that, and successfully beat it down.
Maybe some people in distributional semantics? Perhaps Leonard Bloomfield? I don’t know enough about the history of linguistics to tell what Bloomfield or the “Bloomfieldians” believed in exactly. However, considering that Chomsky was strongly anti-Bloomsfield, it is a fair bet that some Bloomsfieldians (or self-styled “neo-Bloomsfieldians”) would support blank-slate learning of language, if only to show Chomskyans that they’re wrong.
FYI your ‘octopus paper’ link is to Stochastic Parrots; it should be this link.
Though at least in the Quanta piece, Bender doesn’t acknowledge any update of that sort.
I’ve seen other quotes from Bender & relevant coauthors that suggest they haven’t really updated, which I find fascinating. I’d love to have the opportunity to talk with them about it and understand better how their views have remained consistent despite the evidence that’s emerged since the papers were published.
So the octopus paper argument must be wrong somewhere.
It makes a very intuitively compelling argument! I think that, as with many confusions about the Chinese Room, the problem is that our intuitions fail at the relevant scale. Given an Internet’s worth of discussion of bears and sticks and weapons, the hyper-intelligent octopus’s model of those things is rich enough for the octopus to provide advice about them that would work in the real world, even if it perhaps couldn’t recognize a bear by sight. For example it would know that sticks have a certain distribution of mass, and are the sorts of things that could be bound together by rope (which it knows is available because of the coconut catapult), and that the combined sticks might have enough mass to serve as a weapon, and what amounts of force would be harmful to a bear, etc. But it’s very hard to understand just how rich those models can be when our intuitions are primed by a description of two people casually exchanging messages.
These quotes from When ChatGPT Broke an Entire Field: An Oral History stood out to me:
Wow. I knew academics were behind / out of the loop / etc. but this surprised me. I imagine these researchers had at least heard about GPT2 and GPT3 and the scaling laws papers; I wonder what they thought of them at the time. I wonder what they think now about what they thought at the time.
The full article sort of explains the bizarre kafkaesque academic dance that went on from 2020-2022, and how the field talked about these changes.
for anyone not wanting to go in and see the Kafka, I copied some useful examples:
ANNA ROGERS: I was considering making yet another benchmark, but I stopped seeing the point of it. Let’s say GPT-3 either can or cannot continue [generating] these streams of characters. This tells me something about GPT-3, but that’s not actually even a machine learning research question. It’s product testing for free.
JULIAN MICHAEL: There was this term, “API science,’’ that people would use to be like: “We’re doing science on a product? This isn’t science, it’s not reproducible.” And other people were like: “Look, we need to be on the frontier. This is what’s there.”
TAL LINZEN (associate professor of linguistics and data science, New York University; research scientist, Google): For a while people in academia weren’t really sure what to do.
R. THOMAS MCCOY: Are you pro- or anti-LLM? That was in the water very, very much at this time.
JULIE KALLINI (second-year computer science Ph.D. student, Stanford University): As a young researcher, I definitely sensed that there were sides. At the time, I was an undergraduate at Princeton University. I remember distinctly that different people I looked up to — my Princeton research adviser [Christiane Fellbaum] versus professors at other universities — were on different sides. I didn’t know what side to be on.
LIAM DUGAN: You got to see the breakdown of the whole field — the sides coalescing. The linguistic side was not very trusting of raw LLM technology. There’s a side that’s sort of in the middle. And then there’s a completely crazy side that really believed that scaling was going to get us to general intelligence. At the time, I just brushed them off. And then ChatGPT comes out.
+1, GPT3.5 was publicly available since January, and GPT3 was big news two years before and publicly available back then. I’m really surprised that people didn’t understand that these models were a big deal AND changed their minds when ChatGPT came out. Maybe it’s just a weird preference cascade, where this was enough to break a common false belief?
Something like
GPT-3.5/ChatGPT was qualitatively different.
I remember seeing the ChatGPT announcement and not being particularly impressed or excited, like “okay, it’s a refined version of InstructGPT from almost a year ago. It’s cool that there’s a web UI now, maybe I’ll try it out soon.” November 2022 was a technological advancement but not a huge shift compared to January 2022 IMO
Fair enough. My mental image of the GPT models was stuck on that infernal “talking unicorns” prompt, which I think did make them seem reasonably characterized as mere “stochastic parrots” and “glorified autocompletes,” and the obvious bullshit about the “safety and security concerns” around releasing GPT-2 also led me to conclude the tech was unlikely to amount to much more. InstructGPT wasn’t good enough to get me to update it; that took the much-hyped ChatGPT release.
Was there a particular moment that impressed you, or did you just see the Transformers paper, project that correctly into the future, and the releases that followed since then have just been following that trend you extrapolated and so been unremarkable?
I remember being very impressed by GPT-2. I think I was also quite impressed by GPT-3 even though it was basically just “GPT-2 but better.” To be fair, at the moment that I was feeling unimpressed by ChatGPT, I don’t think I had actually used it yet. It did turn out to be much more useful to me than the GPT-3 API, which I tried out but didn’t find that many uses for.
It’s hard to remember exactly how impressed I was with ChatGPT after using it for a while. I think I hadn’t fully realized how great it could be when the friction of using the API was removed, even if I didn’t update that much on the technical advancement.
The full article discusses the transformer paper (which didn’t have a large influence, as the implications weren’t clear), BERT (which did have a large influence) and GPT-3 (which also had a large influence). I assume the release of ChatGPT was the point where even the last NLP researchers couldn’t ignore LLMs anymore.
ChatGPT was “so good they can’t ignore you”; the Hugging Face anecdote is particularly telling. At some point, everyone else gets tired of waiting for your cargo to land, and will fire you if you don’t get with the program. “You say semantics can never be learned from syntax and you’ve proven that ChatGPT can never be useful? It seems plenty useful to me and everyone else. Figure it out or we’ll find someone who can.”
I think the most interesting part of the Quanta piece is the discussion of the octopus paper, which states that pure language models can’t actually understand text (as they only learn from form/syntax), and the bitter disputes that followed in the NLP community.
From the abstract:
Emily M. Bender, the first author, was also first author of the subsequent “stochastic parrot” paper: On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜[1]
(As a side note, Yudkowsky’s broadly verificationist theory of content seems to agree with her distinction: if “understanding” of a statement is knowing what experience would confirm it, or what experience it would predict, then understanding cannot come from syntactic form alone. The association of words and sensory data would be necessary. Did Yudkowsky ever comment on the apparent incompatibility between evident LLM understanding and his anticipated experience theory?)
Of course I assume that now it can hardly be denied that LLMs really do somehow understand text, even if they are merely trained on form. So the octopus paper argument must be wrong somewhere. Though at least in the Quanta piece, Bender doesn’t acknowledge any update of that sort. In fact, in the last quote she says:
First paper I have seen that uses an emoji in its title.
I don’t think there’s any necessary contradiction. Verification or prediction of what? More data. What data? Data. You seem to think there’s some sort of special reality-fluid which JPEGs or MP3s have but .txt files do not, but they don’t; they all share the Buddha-nature.
Consider Bender’s octopus example, where she says that it can’t learn to do anything from watching messages go back and forth. This is obviously false, because we do this all the time; for example, you can teach a LLM to play good chess simply by watching a lot of moves fly by back and forth as people play postal chess. Imitation learning & offline RL are important use-cases of RL and no one would claim it doesn’t work or is impossible in principle.
Can you make predictions and statements which can be verified by watching postal chess games? Of course. Just predict what the next move will be. “I think he will castle, instead of moving the knight.” [later] “Oh no, I was wrong! I anticipated seeing a castling move, and I did not, I saw something else. My beliefs about castling did not pay rent and were not verified by subsequent observations of this game. I will update my priors and do better next time.”
Well, in the chess example we do not have any obvious map/territory relation. Chess seems to be a purely formal game, as the pieces do not seem to refer to anything in the external world. So it’s much less obvious that training on form alone would also work for learning natural language, which does exhibit a map territory distinction.
For example, a few years ago, most people would have regarded it as highly unlikely that you could understand (decode) an intercepted alien message without any contextual information. But if you can understand text from form alone, as LLMs seem to prove, the message simply has to be long enough. Then you can train an LLM on it, which would then be able to understand the message. And it would also be able to translate it into English if it is additionally trained on English text.
That’s very counterintuive, or at least it was counterintuitive until recently. I doubt EY meant to count raw words as “anticipated experience”, since “experience” typically refers to sensory data only. (In fact, I think Guessing the Teacher’s Password also suggests that he didn’t.)
To repeat, I don’t blame him, as the proposition that large amounts of raw text can replace sensory data, that a sufficient amount of symbols can ground themselves, was broadly considered unlikely until LLMs came along. But I do blame Bender insofar as she didn’t update even in light of strong evidence that the classical hypothesis (you can’t infer meaning from form alone) was wrong.
Yes, there is. The transcripts are of 10 million games that real humans played to cover the distribution of real games, and then were annotated by Stockfish, to provide superhuman-quality metadata on good vs bad moves. That is the territory. The map is the set of transcripts.
I would say ‘diverse enough’, not ‘long enough’. (An encyclopedia will teach a LLM many things; a dictionary the same length, probably not.) Similar to meta-learning vs learning.
What external world does our ‘external world’ itself refer to things inside of? If the ‘external world’ doesn’t need its own external world for grounding, then why does lots of text about the external world not suffice? (And if it does, what grounds that external external world, or where does the regress end?) As I like to put it, for an LLM, ‘reality’ is just the largest fictional setting—the one that encompasses all the other fictional settings it reads about from time to time.
As someone who doubtless does quite a lot of reading about things or writing to people you have never seen nor met in real life and have no ‘sensory’ way of knowing that they exist, this is a position you should find sympathetic.
Sympathy or not, the position that meaning of natural language can be inferred from the symbolic form alone wasn’t obvious to me in the past, as this is certainly not how humans learn language, and I don’t know any evidence that someone else thought this plausible before machine learning made it evident. It’s always easy to make something sound obvious after the fact, but that doesn’t mean that it actually was obvious to anyone at the time.
Plenty of linguists and connectionists thought it was possible, if only to show those damned Chomskyans that they were wrong!
To be specific, some of the radical linguists believed in pure distributional semantics, or that there is no semantics beyond syntax. I don’t know anyone in particular, but considering how often Chomsky, Pinker, etc were fighting against the “blank slate” theory, they definitely existed.
The following people likely believed that it is possible to learn a language purely from reading using a general learning architecture like neural networks (blank-slate):
James L. McClelland and David Rumelhart.
They were the main proponents of neural networks in the “past tense debate”. Generally, anyone on the side of neural networks in the past tense debate probably believed this.
B. F. Skinner.
Radical syntacticians? Linguists have failed to settle the question of “Just what is semantics? How is it different from syntax?”, and some linguists have taken the radical position “There is no semantics. Everything is syntax.”. Once that is done, there simply is no difficulty: just learn all the syntax, and there is nothing left to learn.
Possibly some of the participants in the “linguistics wars” believed in it. Specifically, some believed in “generative semantics”, whereby semantics is simply yet more generative grammar, and thus not any different from syntax (also generative grammar). Chomsky, as you might imagine, hated that, and successfully beat it down.
Maybe some people in distributional semantics? Perhaps Leonard Bloomfield? I don’t know enough about the history of linguistics to tell what Bloomfield or the “Bloomfieldians” believed in exactly. However, considering that Chomsky was strongly anti-Bloomsfield, it is a fair bet that some Bloomsfieldians (or self-styled “neo-Bloomsfieldians”) would support blank-slate learning of language, if only to show Chomskyans that they’re wrong.
FYI your ‘octopus paper’ link is to Stochastic Parrots; it should be this link.
I’ve seen other quotes from Bender & relevant coauthors that suggest they haven’t really updated, which I find fascinating. I’d love to have the opportunity to talk with them about it and understand better how their views have remained consistent despite the evidence that’s emerged since the papers were published.
It makes a very intuitively compelling argument! I think that, as with many confusions about the Chinese Room, the problem is that our intuitions fail at the relevant scale. Given an Internet’s worth of discussion of bears and sticks and weapons, the hyper-intelligent octopus’s model of those things is rich enough for the octopus to provide advice about them that would work in the real world, even if it perhaps couldn’t recognize a bear by sight. For example it would know that sticks have a certain distribution of mass, and are the sorts of things that could be bound together by rope (which it knows is available because of the coconut catapult), and that the combined sticks might have enough mass to serve as a weapon, and what amounts of force would be harmful to a bear, etc. But it’s very hard to understand just how rich those models can be when our intuitions are primed by a description of two people casually exchanging messages.
Perhaps relevant, she famously doesn’t like the arXiv, so maybe on principle she’s disregarding all evidence not from “real publications.”