It started with this video of Hinton taking a jab at Marcus: https://twitter.com/tsarnick/status/1754439023551213845
And here is Marcu’s answer:
https://garymarcus.substack.com/p/deconstructing-geoffrey-hintons-weakest
As far as I understand, Gary Marcus argues that LLMs memorize some of their training data, while Hinton argues that no such thing takes place, it’s all just patterns of language.
I found these two papers on LLM memorization:
https://arxiv.org/abs/2202.07646 - Quantifying Memorization Across Neural Language Models
https://browse.arxiv.org/abs/2311.17035 - Scalable Extraction of Training Data from (Production) Language Models
Am I missing something here? Are these two positions compatible or does one need to be wrong for the other one to be correct? What is the crux between them and what experiment could be devised to test it?
So the quote from Hinton is:
Then Gary Marcus gives an example,
You can of course go simply recreate an LLM for yourself,
It may have taken decades to develop this technique, by trying a lot of things that didn’t work, but : https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing
Here’s the transformer trying to find weights to memorize the tokens, by starting at each possible place on the token string.
And here’s part of the reason it doesn’t just repeat exactly what it was given:
That torch.multinomial call says to choose the next token in proportion to the probability that it is the next token from the text strings most similar to the one currently being evaluated.
So:
So Hinton claims Gary Marcus doesn’t understand how LLMs work. Given how simple they are, that’s unlikely to be correct. And Gary Marcus claims that humans don’t make wild confabulations. Which is true but not true: a human has additional neural subsystems than simply a “next token guesser”.
For example a human can to an extent inspect what they are going to say before they say or write it. Before saying Gary Marcus was “inspired by his pet chicken, Henrietta” a human may temporarily store the next words they plan to say elsewhere in the brain, and evaluate it. “Do I remember seeing the chicken? What memories correspond to it? Do the phonemes sound kinda like HEN-RI-ETTA or was it something else...”
Not only this, but subtler confabulations happen all the time. I personally was confused between the films “Scream” and “Scary Movie”, as both occupy the same embeddings space. Apparently I’m not the only one.
Transformer-based also internally represent the tokens they are likely to emit in future steps. Demonstrated rigorously in Future Lens: Anticipating Subsequent Tokens from a Single Hidden State, though perhaps the simpler demonstration is simply that LLMs can reliably complete the sentence “Alice likes apples, Bob likes bananas, and Aaron likes apricots, so when I went to the store I bought Alice an apple and I got [Bob/Aaron]” with the appropriate “a/an” token.
So yes but actually no. What’s happening in the example you gave is the most probable token at each evaluation makes forward progress towards completing the sentence.
Suppose the prompt contained the constraint “the third word of the response must begin with the letter ‘c’.
And the model has already generated “Alice likes apples”.
The current models can be prompted to check all the constraints, and will often notice an error, but have no private buffer to try various generations until one that satisfies the prompt gets generated. Humans have a private buffer and can also write things down they don’t share. (Imagine solving this as a human. You would stop on word 3 and start brainstorming ‘c’ words and wouldn’t continue until you have a completion)
There’s a bunch of errors like this I hit with gpt-4.
Similarly if the probability of a correct generation is very low (“apples” may be far more probable even with the constraint for the letter ‘c’ in the prompt), current models are unable to online learn from their mistakes for common questions they get wrong. This makes them not very useful as “employees” for a specific role yet because they endlessly make the same errors.
Thanks for your answer. Would it be fair to say that both of them are oversimplifying the other’s position and that they are both, to some extent, right?
Yes. Also it says on Marcus’s Wikipedia page he founded 2 AI startups, one in 2015 and another in 2019. It is an unreasonable view to believe he doesn’t understand transformers.
Marcus tried other techniques that evidently didn’t work well enough (SOTA for robotics is now a variation on transformers).
Hinton seems to have partial credit for back propagation and recent credit for capsule networks which work but also didn’t work well enough.
Of course Marcus also calls Hinton old in his response, Hinton is 24 years older.
Obviously LLMs memorize some things, the easy example is that the pretraining dataset of GPT-4 probably contained lots of cryptographically hashed strings which are impossible to infer from the overall patterns of language. Predicting those accurately absolutely requires memorization, there’s literally no other way unless the LLM solves an NP-hard problem. Then there are in-between things like Barack Obama’s age, which might be possible to infer from other language (a president is probably not 10 yrs old or 230), but within the plausible range, you also just need to memorize it.
Where it gets interesting is when you leave the space of token strings the machine has seen, but you are somewhere in the input space “in between” strings it has seen. That’s why this works at all and exhibits any intelligence.
For example if it has seen a whole bunch of patterns like “A->B”, and “C->D”, if you give input “G” it will complete with “->F”.
Or for President ages, what if the president isn’t real? https://chat.openai.com/share/3ccdc340-ada5-4471-b114-0b936d1396ad
There are fake/fictional presidents in the training data.