The main problem with Quintin’s arguments is that their premise is invalid. Modern ML models do not already implement their equivalent of cultural accumulation of knowledge.
Consider a human proving a theorem. They start from some premises, go through some logical steps, and end up with some new principle. They can then show this theorem to another human, and that human can grok it and re-use it without needing to go through the derivation steps (which itself may require learning a ton of new theorems). Repeat for a thousand iterations, and we have some human that can very cheaply re-use the results of a thousand runs of human cognition.
Consider a LLM proving a theorem. It starts from some premises, goes through some logical steps, and ends up with some new principle. Then it runs out of context space, its memory is wiped, and its weights are updated with the results of its previous forward passes. Now, on future forward passes, it’s more likely to go through the valid derivation steps for proving this theorem, and may do so more quickly. Repeat for a thousand iterations, and it becomes really quite good at proving theorems.
But the LLM then cannot explicitly re-use those theorems. For it to be able to fluidly “wield” a new theorem, it needs to be shown a lot of training data in which that theorem is correctly employed. It can’t “grok” it from just being told about it.
For proof, I point you to the Reversal Curse. If a human is told that A is B, they’d instantly update their world-model such that it’d be obvious to them that B is A. LLMs fundamentally cannot do that. They build something like world-models eventually, yes, but their style of “learning” operates on statistical correlations across semantics, not on statistical correlations across facts about the physical reality. The algorithms built by SGD become really quite sophisticated, but every given SGD update doesn’t actually teach them what they just read.[1]
A biological analogy would be if human cultural transmission consisted of professors guiding students through the steps for deriving theorems or applying these theorems, then wiping the students’ declarative knowledge after they complete each exercise. That’s how ML training works: it trains (the equivalents of) instincts and procedural knowledge. Eventually, those become chiseled-in deeply enough that they can approximate declarative knowledge in reliability. But you ain’t being taught facts like that; it’s a cripplingly inefficient way to teach facts.
I don’t want to say things like “stochastic parrots” or “not really intelligent”, because those statements have been used by people making blatantly wrong and excessively dismissive claims about LLM incapabilities. But there really is a sense in which LLMs, and any AI trained on the current paradigm, is a not-really-intelligent stochastic parrot. Stochastic parrotness is more powerful than it’s been made to sound, and it can get you pretty far. But not as far as to reach general intelligence.[2]
I don’t know if I’m explaining myself clearly above. I’d tried to gesture at this issue before, and the result was pretty clumsy.
Here’s a somewhat different tack: the core mistake here lies in premature formalization. We don’t understand yet how human minds and general intelligence work, but it’s very tempting to try and describe them in terms of the current paradigm. It’s then also tempting to dismiss our own intuitions about how our minds work, and the way these intuitions seem to disagree with the attempts at formalization, as self-delusions. We feel that we can “actually understand” things, and that we can “be in control” of ourselves or our instincts, and that we can sometimes learn without the painful practical experience — but it’s tempting to dismiss those as illusions or overconfidence.
But that is, nevertheless, a mistake. And relying on this mistake as we’re making predictions about AGI progress may well make it fatal for us all.
And yes, I know about the thing where LLMs can often reconstruct a new text after being shown it just once. But it’s not about them being able to recreate what they’ve seen if prompted the same way, it’s about them being able fluidly re-use the facts they’ve been shown in any new context.
And even if we consider some clever extensions of context space to infinite length, that still wouldn’t fix this problem.
Which isn’t to say you can’t get from the current paradigm to an AGI-complete paradigm by some small clever trick that’s already been published in some yet-undiscovered paper. I have short timelines and a high p(doom), etc., etc. I just don’t think that neat facts about LLMs would cleanly transfer to AGI, and this particular neat fact seems to rest precisely on the LLM properties that preclude them from being AGI.
For proof, I point you to the Reversal Curse. If a human is told that A is B, they’d instantly update their world-model such that it’d be obvious to them that B is A. LLMs fundamentally cannot do that. They build something like world-models eventually, yes, but their style of “learning” operates on statistical correlations across semantics, not on statistical correlations across facts about the physical reality. The algorithms built by SGD become really quite sophisticated, but every given SGD update doesn’t actually teach them what they just read.
This seems a wrong model of human learning. When I make Anki cards for myself to memorize definitions, it is not sufficient to practice answering with the definition when seeing the term. I need to also practice answering the term when seeing the definition, otherwise I do really terribly in situations requiring the latter skill. Admittedly, there is likely nonzero transfer.
That’s the crux, and also in the fact that even with zero practical use, you still instantly develop some ideas about where you can use the new definition. Yes, to properly master it to the level where you apply it instinctively, you need to chisel it into your own heuristics engine, because humans are LLM-like in places, and perhaps even most of the time. But there’s also a higher-level component on the top that actually “groks” unfamiliar terms even if they’ve just been invented right this moment, can put them in the context of the rest of the world-model, and autonomously figure out how to use them.
Its contributions are oftentimes small or subtle, but they’re crucial. It’s why e. g. LLMs only produce stellar results if they’re “steered” by a human — even if the steerer barely intervenes, only making choices between continuations instead of writing-in custom text.
I expect you underestimate how much transfer there is, or how bad “no transfer” actually looks like.
I mean, its not clear to me there’s zero transfer with LLMs either. At least one person (github page linked in case Twitter/X makes this difficult to access) claims to get non-zero transfer with a basic transformer model. Though I haven’t looked super closely at their results or methods.
Added: Perhaps no current LLM has nonzero transfer. In which case, in light of the above results, I’d guess that this fact will go away with scale, mostly at a time uncorrelated with ASI (self-modifications made by the ASI ignored. Obviously the ASI will self-modify to be better at this task if it can. My point here is to say that the requirements needed to implement this are not necessary or sufficient for ASI), which I anticipate would be against your model.
Added2:
I expect you underestimate how much transfer there is, or how bad “no transfer” actually looks like.
I’m happy to give some numbers here, but this likely depends a lot on context, like how much the human knows about the subject, maybe the age of the humans (older probably less likely to invert, ages where plasticity is high probably more likely, if new to language then less likely), and the subject itself. I think when I memorized the Greek letters I had a transfer rate of about 20%, so lets say my 50% confidence interval is like 15-35%. Using a bunch of made up numbers I get
P(0 <= % transfer < 0.1)=0.051732
P(0.1 <= % transfer < 0.2)=0.16171
P(0.2 <= % transfer < 0.3)=0.2697
P(0.3 <= % transfer < 0.4)=0.20471
P(0.4 <= % transfer < 0.5)=0.11631
P(0.5 <= % transfer < 0.6)=0.050871
P(0.6 <= % transfer < 0.7)=0.035978
P(0.7 <= % transfer < 0.8)=0.034968
P(0.8 <= % transfer < 0.9)=0.035175
P(0.9 <= % transfer <= 1)=0.038851
for my probability distribution. Some things here seem unreasonable to me, but its an ok start to giving hard numbers.
I think it’s wrong to say that LLMs fundamentally cannot do that. I think LLMs do do that, they just do it poorly. So poorly compared to humans that it’s tempting to round their ability to do this down to zero. The difference between near-zero and zero is a really important difference though.
I have been working a lot with LLMs over the past couple years doing AI alignment research full-time, and I have the strong impression that LLMs do a worse job of concept generalization and transfer than humans. Worse, but still non-zero. They do some. This is why I believe that current 2023 LLMs aren’t so great at general reasoning, but that they’ve noticeably improved over ~2021 era LLMs. I think further development and scale of LLMs is a very inefficient way to AGI, but nevertheless will get us there if we don’t come up with a more efficient way first. And unfortunately, I suspect that there are specific algorithmic improvements available to be discovered which will greatly improve efficiency at this specific generalization skill.
It wasn’t my intention to respond to your comment specifically, but rather to add to the thread generally. But yes, I suppose since my comment was directed at Thane then it would make sense to place this as a response to his comment so that he receives the notification about it. I’m not too worried about this though, since neither Thane nor you are my intended recipients of my comment, but rather I speak to the general mass of readers who might come across this thread.
I disagree that the Reversal Curse demonstrates a fundamental lack of sophistication of knowledge on the model’s part. As Neel Nanda explained, it’s not surprising that current LLMs will store A → B but not B → A as they’re basically lookup tables, and this is definitely an important limitation. However, I think this is mainly due to a lack of computational depth. LLMs can perform that kind of deduction when the information is external, that is, if you prompt it with who Tom Cruise’s mom is, it can then answer who Mary Lee Pfeiffer’s son is. If the LLM knew the first part already, you could just prompt it to answer the first question before prompting it with the second. I suspect that a recurrent model like the Universal Transformer would be able to perform the A → B to B → A deduction internally, but for now LLMs must do multi-step computations like that externally with a chain-of-thought. In other words, it can deduce new things, just not in a single forward pass or during backpropagation. If that doesn’t count, then all other demonstrations of multi-step reasoning in LLMs don’t count either. This deduced knowledge is usually discarded, but we can make it permanent with retrieval or fine-tuning. So, I think it’s wrong to say that this entails a fundamental barrier to wielding new knowledge.
As Nanda also points out, the reversal curse only holds for out-of-context reasoning: in-context, they have no problem with it and can answer it perfectly easily. So, it is a false analogy here because he’s eliding the distinction between in-context and prompt-only (training). Humans do not do what he claims they do: “instantly update their world-model such that it’d be obvious to them that B is A”. At least, in terms of permanent learning rather than in-context reasoning.
For example, I can tell you that Tom Cruise’s mother is named ‘Mary Lee Pfeiffer’ (thanks to that post) but I cannot tell you who ‘Mary Lee Pfeiffer’ is out of the blue, any more than I can sing the alphabet song backwards spontaneously and fluently. But—like an LLM—I can easily do both once I read your comment and now the string “if you prompt it with who Tom Cruise’s mom is, it can then answer who Mary Lee Pfeiffer’s son is” is in my context (working/short-term memory). I expect, however, that despite my ability to do so as I write this comment, if you ask me again in a month ‘who is Mary Lee Pfeiffer?’ I will stare blankly at you and guess ‘...a character on Desperate Housewives, maybe?’
It will take several repetitions, even optimally spaced, before I have a good chance of answering ‘ah yes, she’s Tom Cruise’s mother’ without any context. Because I do not ‘instantly update my world-model such that it’d be obvious to me that [Mary Lee Pfeiffer] is [the mother of Tom Cruise]’.
But the LLM then cannot explicitly re-use those theorems. For it to be able to fluidly “wield” a new theorem, it needs to be shown a lot of training data in which that theorem is correctly employ
This is an empirical claim but I’m not sure if it’s true. It seems analogous to me to an LLM doing better on a test merely by fine tuning on descriptions of the test, and not on examples of the test being taken—which surprisingly is a real world result:
On a skim, the paper still involves giving the model instructions first; i. e., the prompt has to start with “A is...” for the fine-tuning to kick in and the model to output “B”.
Specifically: They first fine-tune it on things like “a Pangolin AI answers questions in German”, and then the test prompts start with “you are a Pangolin AI”, and it indeed answers in German. Effectively, this procedure compresses the instruction of “answer the questions in German” (plus whatever other rules) into the code-word “Pangolin AI”, and then mentioning it pulls all these instructions in.
Experiment 1c is an interesting case, since it sounds like they train on “A is B” and “A is C”, then start prompts with “you are B” and it pulls in “C” in a seeming contradiction of the Reversal Curse paper… but looking at the examples (page 41), it sounds like it’s trained on a bunch of “A is B” and “B is A”, explicitly chiseling-in that A and B are synonyms; and then the experiment reduces to the mechanism I’ve outlined above. (And the accuracy still drops to 9%.)
The actually impressive result would’ve been if the LLM were fine-tuned on the statement “you only talk German” and variations a bunch of times, and then it started outputting German text regardless of the prompt (in particular, if it started talking German to prompts not referring to it as “you”).
Still, that’s a relevant example, thanks for linking it!
The main problem with Quintin’s arguments is that their premise is invalid. Modern ML models do not already implement their equivalent of cultural accumulation of knowledge.
Consider a human proving a theorem. They start from some premises, go through some logical steps, and end up with some new principle. They can then show this theorem to another human, and that human can grok it and re-use it without needing to go through the derivation steps (which itself may require learning a ton of new theorems). Repeat for a thousand iterations, and we have some human that can very cheaply re-use the results of a thousand runs of human cognition.
Consider a LLM proving a theorem. It starts from some premises, goes through some logical steps, and ends up with some new principle. Then it runs out of context space, its memory is wiped, and its weights are updated with the results of its previous forward passes. Now, on future forward passes, it’s more likely to go through the valid derivation steps for proving this theorem, and may do so more quickly. Repeat for a thousand iterations, and it becomes really quite good at proving theorems.
But the LLM then cannot explicitly re-use those theorems. For it to be able to fluidly “wield” a new theorem, it needs to be shown a lot of training data in which that theorem is correctly employed. It can’t “grok” it from just being told about it.
For proof, I point you to the Reversal Curse. If a human is told that A is B, they’d instantly update their world-model such that it’d be obvious to them that B is A. LLMs fundamentally cannot do that. They build something like world-models eventually, yes, but their style of “learning” operates on statistical correlations across semantics, not on statistical correlations across facts about the physical reality. The algorithms built by SGD become really quite sophisticated, but every given SGD update doesn’t actually teach them what they just read.[1]
A biological analogy would be if human cultural transmission consisted of professors guiding students through the steps for deriving theorems or applying these theorems, then wiping the students’ declarative knowledge after they complete each exercise. That’s how ML training works: it trains (the equivalents of) instincts and procedural knowledge. Eventually, those become chiseled-in deeply enough that they can approximate declarative knowledge in reliability. But you ain’t being taught facts like that; it’s a cripplingly inefficient way to teach facts.
I don’t want to say things like “stochastic parrots” or “not really intelligent”, because those statements have been used by people making blatantly wrong and excessively dismissive claims about LLM incapabilities. But there really is a sense in which LLMs, and any AI trained on the current paradigm, is a not-really-intelligent stochastic parrot. Stochastic parrotness is more powerful than it’s been made to sound, and it can get you pretty far. But not as far as to reach general intelligence.[2]
I don’t know if I’m explaining myself clearly above. I’d tried to gesture at this issue before, and the result was pretty clumsy.
Here’s a somewhat different tack: the core mistake here lies in premature formalization. We don’t understand yet how human minds and general intelligence work, but it’s very tempting to try and describe them in terms of the current paradigm. It’s then also tempting to dismiss our own intuitions about how our minds work, and the way these intuitions seem to disagree with the attempts at formalization, as self-delusions. We feel that we can “actually understand” things, and that we can “be in control” of ourselves or our instincts, and that we can sometimes learn without the painful practical experience — but it’s tempting to dismiss those as illusions or overconfidence.
But that is, nevertheless, a mistake. And relying on this mistake as we’re making predictions about AGI progress may well make it fatal for us all.
And yes, I know about the thing where LLMs can often reconstruct a new text after being shown it just once. But it’s not about them being able to recreate what they’ve seen if prompted the same way, it’s about them being able fluidly re-use the facts they’ve been shown in any new context.
And even if we consider some clever extensions of context space to infinite length, that still wouldn’t fix this problem.
Which isn’t to say you can’t get from the current paradigm to an AGI-complete paradigm by some small clever trick that’s already been published in some yet-undiscovered paper. I have short timelines and a high p(doom), etc., etc. I just don’t think that neat facts about LLMs would cleanly transfer to AGI, and this particular neat fact seems to rest precisely on the LLM properties that preclude them from being AGI.
This seems a wrong model of human learning. When I make Anki cards for myself to memorize definitions, it is not sufficient to practice answering with the definition when seeing the term. I need to also practice answering the term when seeing the definition, otherwise I do really terribly in situations requiring the latter skill. Admittedly, there is likely nonzero transfer.
That’s the crux, and also in the fact that even with zero practical use, you still instantly develop some ideas about where you can use the new definition. Yes, to properly master it to the level where you apply it instinctively, you need to chisel it into your own heuristics engine, because humans are LLM-like in places, and perhaps even most of the time. But there’s also a higher-level component on the top that actually “groks” unfamiliar terms even if they’ve just been invented right this moment, can put them in the context of the rest of the world-model, and autonomously figure out how to use them.
Its contributions are oftentimes small or subtle, but they’re crucial. It’s why e. g. LLMs only produce stellar results if they’re “steered” by a human — even if the steerer barely intervenes, only making choices between continuations instead of writing-in custom text.
I expect you underestimate how much transfer there is, or how bad “no transfer” actually looks like.
I mean, its not clear to me there’s zero transfer with LLMs either. At least one person (github page linked in case Twitter/X makes this difficult to access) claims to get non-zero transfer with a basic transformer model. Though I haven’t looked super closely at their results or methods.
Added: Perhaps no current LLM has nonzero transfer. In which case, in light of the above results, I’d guess that this fact will go away with scale, mostly at a time uncorrelated with ASI (self-modifications made by the ASI ignored. Obviously the ASI will self-modify to be better at this task if it can. My point here is to say that the requirements needed to implement this are not necessary or sufficient for ASI), which I anticipate would be against your model.
Added2:
I’m happy to give some numbers here, but this likely depends a lot on context, like how much the human knows about the subject, maybe the age of the humans (older probably less likely to invert, ages where plasticity is high probably more likely, if new to language then less likely), and the subject itself. I think when I memorized the Greek letters I had a transfer rate of about 20%, so lets say my 50% confidence interval is like 15-35%. Using a bunch of made up numbers I get
for my probability distribution. Some things here seem unreasonable to me, but its an ok start to giving hard numbers.
I think it’s wrong to say that LLMs fundamentally cannot do that. I think LLMs do do that, they just do it poorly. So poorly compared to humans that it’s tempting to round their ability to do this down to zero. The difference between near-zero and zero is a really important difference though.
I have been working a lot with LLMs over the past couple years doing AI alignment research full-time, and I have the strong impression that LLMs do a worse job of concept generalization and transfer than humans. Worse, but still non-zero. They do some. This is why I believe that current 2023 LLMs aren’t so great at general reasoning, but that they’ve noticeably improved over ~2021 era LLMs. I think further development and scale of LLMs is a very inefficient way to AGI, but nevertheless will get us there if we don’t come up with a more efficient way first. And unfortunately, I suspect that there are specific algorithmic improvements available to be discovered which will greatly improve efficiency at this specific generalization skill.
I think you responded to the wrong comment.
It wasn’t my intention to respond to your comment specifically, but rather to add to the thread generally. But yes, I suppose since my comment was directed at Thane then it would make sense to place this as a response to his comment so that he receives the notification about it. I’m not too worried about this though, since neither Thane nor you are my intended recipients of my comment, but rather I speak to the general mass of readers who might come across this thread.
https://www.lesswrong.com/posts/Wr7N9ji36EvvvrqJK/response-to-quintin-pope-s-evolution-provides-no-evidence?commentId=FLCpkJHWyqoWZG67B
I disagree that the Reversal Curse demonstrates a fundamental lack of sophistication of knowledge on the model’s part. As Neel Nanda explained, it’s not surprising that current LLMs will store A → B but not B → A as they’re basically lookup tables, and this is definitely an important limitation. However, I think this is mainly due to a lack of computational depth. LLMs can perform that kind of deduction when the information is external, that is, if you prompt it with who Tom Cruise’s mom is, it can then answer who Mary Lee Pfeiffer’s son is. If the LLM knew the first part already, you could just prompt it to answer the first question before prompting it with the second. I suspect that a recurrent model like the Universal Transformer would be able to perform the A → B to B → A deduction internally, but for now LLMs must do multi-step computations like that externally with a chain-of-thought. In other words, it can deduce new things, just not in a single forward pass or during backpropagation. If that doesn’t count, then all other demonstrations of multi-step reasoning in LLMs don’t count either. This deduced knowledge is usually discarded, but we can make it permanent with retrieval or fine-tuning. So, I think it’s wrong to say that this entails a fundamental barrier to wielding new knowledge.
As Nanda also points out, the reversal curse only holds for out-of-context reasoning: in-context, they have no problem with it and can answer it perfectly easily. So, it is a false analogy here because he’s eliding the distinction between in-context and prompt-only (training). Humans do not do what he claims they do: “instantly update their world-model such that it’d be obvious to them that B is A”. At least, in terms of permanent learning rather than in-context reasoning.
For example, I can tell you that Tom Cruise’s mother is named ‘Mary Lee Pfeiffer’ (thanks to that post) but I cannot tell you who ‘Mary Lee Pfeiffer’ is out of the blue, any more than I can sing the alphabet song backwards spontaneously and fluently. But—like an LLM—I can easily do both once I read your comment and now the string “if you prompt it with who Tom Cruise’s mom is, it can then answer who Mary Lee Pfeiffer’s son is” is in my context (working/short-term memory). I expect, however, that despite my ability to do so as I write this comment, if you ask me again in a month ‘who is Mary Lee Pfeiffer?’ I will stare blankly at you and guess ‘...a character on Desperate Housewives, maybe?’
It will take several repetitions, even optimally spaced, before I have a good chance of answering ‘ah yes, she’s Tom Cruise’s mother’ without any context. Because I do not ‘instantly update my world-model such that it’d be obvious to me that [Mary Lee Pfeiffer] is [the mother of Tom Cruise]’.
This is an empirical claim but I’m not sure if it’s true. It seems analogous to me to an LLM doing better on a test merely by fine tuning on descriptions of the test, and not on examples of the test being taken—which surprisingly is a real world result:
https://arxiv.org/abs/2309.00667
On a skim, the paper still involves giving the model instructions first; i. e., the prompt has to start with “A is...” for the fine-tuning to kick in and the model to output “B”.
Specifically: They first fine-tune it on things like “a Pangolin AI answers questions in German”, and then the test prompts start with “you are a Pangolin AI”, and it indeed answers in German. Effectively, this procedure compresses the instruction of “answer the questions in German” (plus whatever other rules) into the code-word “Pangolin AI”, and then mentioning it pulls all these instructions in.
Experiment 1c is an interesting case, since it sounds like they train on “A is B” and “A is C”, then start prompts with “you are B” and it pulls in “C” in a seeming contradiction of the Reversal Curse paper… but looking at the examples (page 41), it sounds like it’s trained on a bunch of “A is B” and “B is A”, explicitly chiseling-in that A and B are synonyms; and then the experiment reduces to the mechanism I’ve outlined above. (And the accuracy still drops to 9%.)
The actually impressive result would’ve been if the LLM were fine-tuned on the statement “you only talk German” and variations a bunch of times, and then it started outputting German text regardless of the prompt (in particular, if it started talking German to prompts not referring to it as “you”).
Still, that’s a relevant example, thanks for linking it!