I think the most interesting part of the Quanta piece is the discussion of the octopus paper, which states that pure language models can’t actually understand text (as they only learn from form/syntax), and the bitter disputes that followed in the NLP community.
From the abstract:
The success of the large neural language models on many NLP tasks is exciting. However, we find that these successes sometimes lead to hype in which these models are being described as “understanding” language or capturing “meaning”. In this position paper, we argue that a system trained only on form has a priori no way to learn meaning. In keeping with the ACL 2020 theme of “Taking Stock of Where We’ve Been and Where We’re Going”, we argue that a clear understanding of the distinction between form and meaning will help guide the field towards better science around natural language understanding.
(As a side note, Yudkowsky’s broadly verificationist theory of content seems to agree with her distinction: if “understanding” of a statement is knowing what experience would confirm it, or what experience it would predict, then understanding cannot come from syntactic form alone. The association of words and sensory data would be necessary. Did Yudkowsky ever comment on the apparent incompatibility between evident LLM understanding and his anticipated experience theory?)
Of course I assume that now it can hardly be denied that LLMs really do somehow understand text, even if they are merely trained on form. So the octopus paper argument must be wrong somewhere. Though at least in the Quanta piece, Bender doesn’t acknowledge any update of that sort. In fact, in the last quote she says:
I have seen an enormous shift towards end-to-end solutions using chatbots or related synthetic text-extruding machines. And I believe it to be a dead end.
I don’t think there’s any necessary contradiction. Verification or prediction of what? More data. What data? Data. You seem to think there’s some sort of special reality-fluid which JPEGs or MP3s have but .txt files do not, but they don’t; they all share the Buddha-nature.
Consider Bender’s octopus example, where she says that it can’t learn to do anything from watching messages go back and forth. This is obviously false, because we do this all the time; for example, you can teach a LLM to play good chess simply by watching a lot of moves fly by back and forth as people play postal chess. Imitation learning & offline RL are important use-cases of RL and no one would claim it doesn’t work or is impossible in principle.
Can you make predictions and statements which can be verified by watching postal chess games? Of course. Just predict what the next move will be. “I think he will castle, instead of moving the knight.” [later] “Oh no, I was wrong! I anticipated seeing a castling move, and I did not, I saw something else. My beliefs about castling did not pay rent and were not verified by subsequent observations of this game. I will update my priors and do better next time.”
Well, in the chess example we do not have any obvious map/territory relation. Chess seems to be a purely formal game, as the pieces do not seem to refer to anything in the external world. So it’s much less obvious that training on form alone would also work for learning natural language, which does exhibit a map territory distinction.
For example, a few years ago, most people would have regarded it as highly unlikely that you could understand (decode) an intercepted alien message without any contextual information. But if you can understand text from form alone, as LLMs seem to prove, the message simply has to be long enough. Then you can train an LLM on it, which would then be able to understand the message. And it would also be able to translate it into English if it is additionally trained on English text.
That’s very counterintuive, or at least it was counterintuitive until recently. I doubt EY meant to count raw words as “anticipated experience”, since “experience” typically refers to sensory data only. (In fact, I think Guessing the Teacher’s Password also suggests that he didn’t.)
To repeat, I don’t blame him, as the proposition that large amounts of raw text can replace sensory data, that a sufficient amount of symbols can ground themselves, was broadly considered unlikely until LLMs came along. But I do blame Bender insofar as she didn’t update even in light of strong evidence that the classical hypothesis (you can’t infer meaning from form alone) was wrong.
Well, in the chess example we do not have any obvious map/territory relation.
Yes, there is. The transcripts are of 10 million games that real humans played to cover the distribution of real games, and then were annotated by Stockfish, to provide superhuman-quality metadata on good vs bad moves. That is the territory. The map is the set of transcripts.
But if you can understand text from form alone, as LLMs seem to prove, the message simply has to be long enough.
I would say ‘diverse enough’, not ‘long enough’. (An encyclopedia will teach a LLM many things; a dictionary the same length, probably not.) Similar to meta-learning vs learning.
the pieces do not seem to refer to anything in the external world.
What external world does our ‘external world’ itself refer to things inside of? If the ‘external world’ doesn’t need its own external world for grounding, then why does lots of text about the external world not suffice? (And if it does, what grounds that external external world, or where does the regress end?) As I like to put it, for an LLM, ‘reality’ is just the largest fictional setting—the one that encompasses all the other fictional settings it reads about from time to time.
As someone who doubtless does quite a lot of reading about things or writing to people you have never seen nor met in real life and have no ‘sensory’ way of knowing that they exist, this is a position you should find sympathetic.
Sympathy or not, the position that meaning of natural language can be inferred from the symbolic form alone wasn’t obvious to me in the past, as this is certainly not how humans learn language, and I don’t know any evidence that someone else thought this plausible before machine learning made it evident. It’s always easy to make something sound obvious after the fact, but that doesn’t mean that it actually was obvious to anyone at the time.
Plenty of linguists and connectionists thought it was possible, if only to show those damned Chomskyans that they were wrong!
To be specific, some of the radical linguists believed in pure distributional semantics, or that there is no semantics beyond syntax. I don’t know anyone in particular, but considering how often Chomsky, Pinker, etc were fighting against the “blank slate” theory, they definitely existed.
The following people likely believed that it is possible to learn a language purely from reading using a general learning architecture like neural networks (blank-slate):
James L. McClelland and David Rumelhart.
They were the main proponents of neural networks in the “past tense debate”. Generally, anyone on the side of neural networks in the past tense debate probably believed this.
B. F. Skinner.
Radical syntacticians? Linguists have failed to settle the question of “Just what is semantics? How is it different from syntax?”, and some linguists have taken the radical position “There is no semantics. Everything is syntax.”. Once that is done, there simply is no difficulty: just learn all the syntax, and there is nothing left to learn.
Possibly some of the participants in the “linguistics wars” believed in it. Specifically, some believed in “generative semantics”, whereby semantics is simply yet more generative grammar, and thus not any different from syntax (also generative grammar). Chomsky, as you might imagine, hated that, and successfully beat it down.
Maybe some people in distributional semantics? Perhaps Leonard Bloomfield? I don’t know enough about the history of linguistics to tell what Bloomfield or the “Bloomfieldians” believed in exactly. However, considering that Chomsky was strongly anti-Bloomsfield, it is a fair bet that some Bloomsfieldians (or self-styled “neo-Bloomsfieldians”) would support blank-slate learning of language, if only to show Chomskyans that they’re wrong.
FYI your ‘octopus paper’ link is to Stochastic Parrots; it should be this link.
Though at least in the Quanta piece, Bender doesn’t acknowledge any update of that sort.
I’ve seen other quotes from Bender & relevant coauthors that suggest they haven’t really updated, which I find fascinating. I’d love to have the opportunity to talk with them about it and understand better how their views have remained consistent despite the evidence that’s emerged since the papers were published.
So the octopus paper argument must be wrong somewhere.
It makes a very intuitively compelling argument! I think that, as with many confusions about the Chinese Room, the problem is that our intuitions fail at the relevant scale. Given an Internet’s worth of discussion of bears and sticks and weapons, the hyper-intelligent octopus’s model of those things is rich enough for the octopus to provide advice about them that would work in the real world, even if it perhaps couldn’t recognize a bear by sight. For example it would know that sticks have a certain distribution of mass, and are the sorts of things that could be bound together by rope (which it knows is available because of the coconut catapult), and that the combined sticks might have enough mass to serve as a weapon, and what amounts of force would be harmful to a bear, etc. But it’s very hard to understand just how rich those models can be when our intuitions are primed by a description of two people casually exchanging messages.
I think the most interesting part of the Quanta piece is the discussion of the octopus paper, which states that pure language models can’t actually understand text (as they only learn from form/syntax), and the bitter disputes that followed in the NLP community.
From the abstract:
Emily M. Bender, the first author, was also first author of the subsequent “stochastic parrot” paper: On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜[1]
(As a side note, Yudkowsky’s broadly verificationist theory of content seems to agree with her distinction: if “understanding” of a statement is knowing what experience would confirm it, or what experience it would predict, then understanding cannot come from syntactic form alone. The association of words and sensory data would be necessary. Did Yudkowsky ever comment on the apparent incompatibility between evident LLM understanding and his anticipated experience theory?)
Of course I assume that now it can hardly be denied that LLMs really do somehow understand text, even if they are merely trained on form. So the octopus paper argument must be wrong somewhere. Though at least in the Quanta piece, Bender doesn’t acknowledge any update of that sort. In fact, in the last quote she says:
First paper I have seen that uses an emoji in its title.
I don’t think there’s any necessary contradiction. Verification or prediction of what? More data. What data? Data. You seem to think there’s some sort of special reality-fluid which JPEGs or MP3s have but .txt files do not, but they don’t; they all share the Buddha-nature.
Consider Bender’s octopus example, where she says that it can’t learn to do anything from watching messages go back and forth. This is obviously false, because we do this all the time; for example, you can teach a LLM to play good chess simply by watching a lot of moves fly by back and forth as people play postal chess. Imitation learning & offline RL are important use-cases of RL and no one would claim it doesn’t work or is impossible in principle.
Can you make predictions and statements which can be verified by watching postal chess games? Of course. Just predict what the next move will be. “I think he will castle, instead of moving the knight.” [later] “Oh no, I was wrong! I anticipated seeing a castling move, and I did not, I saw something else. My beliefs about castling did not pay rent and were not verified by subsequent observations of this game. I will update my priors and do better next time.”
Well, in the chess example we do not have any obvious map/territory relation. Chess seems to be a purely formal game, as the pieces do not seem to refer to anything in the external world. So it’s much less obvious that training on form alone would also work for learning natural language, which does exhibit a map territory distinction.
For example, a few years ago, most people would have regarded it as highly unlikely that you could understand (decode) an intercepted alien message without any contextual information. But if you can understand text from form alone, as LLMs seem to prove, the message simply has to be long enough. Then you can train an LLM on it, which would then be able to understand the message. And it would also be able to translate it into English if it is additionally trained on English text.
That’s very counterintuive, or at least it was counterintuitive until recently. I doubt EY meant to count raw words as “anticipated experience”, since “experience” typically refers to sensory data only. (In fact, I think Guessing the Teacher’s Password also suggests that he didn’t.)
To repeat, I don’t blame him, as the proposition that large amounts of raw text can replace sensory data, that a sufficient amount of symbols can ground themselves, was broadly considered unlikely until LLMs came along. But I do blame Bender insofar as she didn’t update even in light of strong evidence that the classical hypothesis (you can’t infer meaning from form alone) was wrong.
Yes, there is. The transcripts are of 10 million games that real humans played to cover the distribution of real games, and then were annotated by Stockfish, to provide superhuman-quality metadata on good vs bad moves. That is the territory. The map is the set of transcripts.
I would say ‘diverse enough’, not ‘long enough’. (An encyclopedia will teach a LLM many things; a dictionary the same length, probably not.) Similar to meta-learning vs learning.
What external world does our ‘external world’ itself refer to things inside of? If the ‘external world’ doesn’t need its own external world for grounding, then why does lots of text about the external world not suffice? (And if it does, what grounds that external external world, or where does the regress end?) As I like to put it, for an LLM, ‘reality’ is just the largest fictional setting—the one that encompasses all the other fictional settings it reads about from time to time.
As someone who doubtless does quite a lot of reading about things or writing to people you have never seen nor met in real life and have no ‘sensory’ way of knowing that they exist, this is a position you should find sympathetic.
Sympathy or not, the position that meaning of natural language can be inferred from the symbolic form alone wasn’t obvious to me in the past, as this is certainly not how humans learn language, and I don’t know any evidence that someone else thought this plausible before machine learning made it evident. It’s always easy to make something sound obvious after the fact, but that doesn’t mean that it actually was obvious to anyone at the time.
Plenty of linguists and connectionists thought it was possible, if only to show those damned Chomskyans that they were wrong!
To be specific, some of the radical linguists believed in pure distributional semantics, or that there is no semantics beyond syntax. I don’t know anyone in particular, but considering how often Chomsky, Pinker, etc were fighting against the “blank slate” theory, they definitely existed.
The following people likely believed that it is possible to learn a language purely from reading using a general learning architecture like neural networks (blank-slate):
James L. McClelland and David Rumelhart.
They were the main proponents of neural networks in the “past tense debate”. Generally, anyone on the side of neural networks in the past tense debate probably believed this.
B. F. Skinner.
Radical syntacticians? Linguists have failed to settle the question of “Just what is semantics? How is it different from syntax?”, and some linguists have taken the radical position “There is no semantics. Everything is syntax.”. Once that is done, there simply is no difficulty: just learn all the syntax, and there is nothing left to learn.
Possibly some of the participants in the “linguistics wars” believed in it. Specifically, some believed in “generative semantics”, whereby semantics is simply yet more generative grammar, and thus not any different from syntax (also generative grammar). Chomsky, as you might imagine, hated that, and successfully beat it down.
Maybe some people in distributional semantics? Perhaps Leonard Bloomfield? I don’t know enough about the history of linguistics to tell what Bloomfield or the “Bloomfieldians” believed in exactly. However, considering that Chomsky was strongly anti-Bloomsfield, it is a fair bet that some Bloomsfieldians (or self-styled “neo-Bloomsfieldians”) would support blank-slate learning of language, if only to show Chomskyans that they’re wrong.
FYI your ‘octopus paper’ link is to Stochastic Parrots; it should be this link.
I’ve seen other quotes from Bender & relevant coauthors that suggest they haven’t really updated, which I find fascinating. I’d love to have the opportunity to talk with them about it and understand better how their views have remained consistent despite the evidence that’s emerged since the papers were published.
It makes a very intuitively compelling argument! I think that, as with many confusions about the Chinese Room, the problem is that our intuitions fail at the relevant scale. Given an Internet’s worth of discussion of bears and sticks and weapons, the hyper-intelligent octopus’s model of those things is rich enough for the octopus to provide advice about them that would work in the real world, even if it perhaps couldn’t recognize a bear by sight. For example it would know that sticks have a certain distribution of mass, and are the sorts of things that could be bound together by rope (which it knows is available because of the coconut catapult), and that the combined sticks might have enough mass to serve as a weapon, and what amounts of force would be harmful to a bear, etc. But it’s very hard to understand just how rich those models can be when our intuitions are primed by a description of two people casually exchanging messages.
Perhaps relevant, she famously doesn’t like the arXiv, so maybe on principle she’s disregarding all evidence not from “real publications.”