Human will be able understand how a human-level GPTs (trained to do next-token-prediction) complete complicated tasks by reading the chains of thought.
What about LMs trained with say, in-fill objectives (like UL2, GLM, T5, etc.?
I’m curious what you think about instruction finetuning or RLHF via expert distillation, both of which can be done using a next-token prediction objective—do you think this would still lead to intepretable cognition? Would this be closer to NPT on all text or RL?
If so, what’s the difference between RLHF and instruction finetuning or expert distillation? Why can’t we cast all RL tasks into “imitate the good trajectories using next token prediction”?
What about BoN selection? As in, take a reward function, sample N trajectories, and then select the best? If you sample best of N, the expected cost of this policy is at most N times the expected cost of the base policy. At comparable levels of competence, do you expect sampling BoN on a human imitator to be safer than an RLHF’ed or expert distilled model?
Where does prompt engineering fall into the spectrum between RL and NPT on web corpora?
I also have a few questions/confusions about the headline claim:
With a model only trained to do next-token-prediction, the model can only use the logic and patterns of the data it has been trained on, which doesn’t contain ways to carry out chains of thought not understandable by human beings trying hard to decipher them.[8]
Human text output contains not just the thoughts of people, but the results of significant amounts of thinking. For example, an answer sheet for a multiple choice math test does not contain the cognition used to generate the answers. Shouldn’t this create an incentive for sequential foundation models (like GPTs) to embody more and more thoughts that humans do via system 2 into its “system 1”, so to speak?
I also expect GPTs without chains of thought to remain really bad at the of tasks which human can’t to do at all with System 1, even with a lot of training.
I agree that it seems unlikely that GPTs in the near future will do zero shot stenography, but why do you expect that GPTs will fail to incorporate system 2 logic into its system 1? Is this mainly a competitiveness reason? (I.e. if GPT-5 can do a task that requires system 2 capabilities without CoT, then GPT-4 can do the same task with CoT, and so the first dangerous AIs built via GPT + CoT will rely on CoT to do the system 2 tasks?)
However, self-deception rarely appears through self-supervised learning alone. It can happen (for example, if you unconsciously imitate people who underestimate how late they will be) but it intuitively seems that “self-lies” learned through self-supervised learning are limited in scope and strength. Therefore, I expect that a human-level GPT trained with self-supervised learning would have an even harder time escaping human supervision than a human would escaping 5-year-olds with a mind reader.
Given that GPT is imitating humans who do deceive each other and themselves, including many who are much more sophisticated liars than a 5 year old + mind reader, I’m not sure why you’re so confident that we’d get so much transparency by default.
My main contention with the post is that I’m not sure that chain of thought necessarily works the way it intuitively does for humans, in the same way that prompting may not work the same way that prompting humans does! For example, a recent paper did automatic CoT prompts for a bunch of datasets, and some of the CoT examples from the AQuA dataset are pretty horrifying:
Q: A car finishes a journey in 20 hours at the speed of 60 km/hr. If the same distance is to be covered in 10 hours, how much speed does the car gain? Answer Choices: (A) 80 kmph (B) 50 kmph (C) 120 kmph (D) 70 kmph (E) 80 kmph A: Let’s think step by step. The car is currently travelling at 60 km/hr. In order to travel the same distance in 10 hours, the car would need to travel at 6 times the speed, or 360 km/hr. Therefore, the car would need to gain 300 km/hr. The answer is C. The answer is C.
Q: Mark told John “If you give me half your money I will have Rs.75. John said, “if you give me one third of your money, I will have Rs.75/-. How much money did John have ? Answer Choices: (A) 22 (B) 60 (C) 28 (D) 26 (E) 18 A: Let’s think step by step. If Mark has Rs.75, then John has Rs.75 as well. If Mark gives John half his money, then Mark will have Rs.37.50 and John will have Rs.112.50. However, this isn’t one of the answer choices. If Mark gives John one third of his money, then Mark will have Rs.50 and John will have Rs.100. This is one of the answer choices, so John must have had Rs.100 to begin with. The answer is D.
A prompt including these two incorrect reasoning example can have better performance over some human engineered CoT prompts.
A prompt including these two incorrect reasoning example can have better performance over some human engineered CoT prompts.
Yes. An even better example would be a different paper showing that you can simply shuffle all the answers in the few-shot prompt to improve over zero-shot. This was tweeted with the snark ‘how is few-shot learning “meta-learning” if the answers don’t need to be right?’ But of course, there’s no problem with that. All meta-learning is about is about solving the problem, not about respecting your preconceived notions of how the model ‘ought’ to compute. (Many optimal strategies, such as exact ones obtained by dynamic programming, can look bizarre to humans, so that’s not a good criteria.)
This is because contrary to OP, a GPT model is not doing something as simple as reasoning explicitly through text or System I pattern-matching. (Or at least, if it is, the meaning of those two things are much broader and more opaque than one would think and so merely replaces an enigma with a mystery.) A GPT model doing few-shot learning is doing meta-learning: it is solving a (very large) family of related tasks using informative priors to efficiently and Bayes-optimally infer the latent variables of each specific problem (such as the agent involved, competence, spelling ability etc) in order to minimize predictive loss. Including example questions with the wrong answers is still useful if it helps narrow down the uncertainty and elicit the right final answer. There are many properties of any piece of text which go beyond merely being ‘the right answer’, such as the a priori range of numbers or the formatting of the answer or the average length or… These can be more useful than mere correctness. Just as in real life, when you see a confusing question and then see an incorrect answer, that can often be very enlightening as to the question asker’s expectations & assumptions, and you then know how to answer it. (‘Bad’ prompts may also just stochastically happen to tickle a given model in a way that favors the ‘right’ family of tasks—think adversarial examples but in a more benign ‘machine teaching’ way.)
Prompts are not ‘right’ or ‘wrong’; they are programs, which only have meaning with respect to the larger family of prompts/tasks which are the context in which they are interpreted by the neural computer. Since you still don’t know what that computer is or how the prompt is being interpreted, you can’t say that the baseline model is reasoning how you think it is based on a naive human-style reading of the prompt. It obviously is not reasoning like that! RLHF will make it worse, but the default GPT behavior given a prompt is already opaque and inscrutable and alien.
(If you don’t like the meta-learning perspective, then I would point out the literature on LMs ‘cheating’ by finding various dataset biases and shortcuts to achieve high scores, often effectively solving tasks, while not learning the abilities that one assumed was necessary to solve those tasks. They look like they are reasoning, they get the right answer… and then it turns out you can, say, remove the input and still get high scores, or something like that.)
I broadly agree, though I haven’t thought enough to be certain in either view.
Yes. An even better example would be a different paper showing that you can simply shuffle all the answers in the few-shot prompt to improve over zero-shot.
Yeah, I thought about this result too, though I couldn’t find it quickly enough to reference it.
What’s in your view the difference between GPTs and the brain? Isn’t the brain also doing meta-learning when you “sample your next thought”? I never said System 1 was only doing pattern matching. System 1 can definitely do very complex things (for example, in real time strategy game, great players often rely only on System 1 to take strategic decisions). I’m pretty sure your System 1 is solving a (very large) family of related tasks using informative priors to efficiently and Bayes-optimally infer the latent variables of each specific problem (but you’re only aware of what gets sampled). Still, System 1 is limited by the number of serial steps, which is why I think our prior on what System 1 can do should put a very low weight on “it simulates an agent which reasons from first principles that it should take control of the future and finds a good plan to do so”.
If your main point of disagreement is “GPT is using different information in the next than humans” because it has been found that GPT used information humans can’t use, I would like to have a clear example of that. The one you give doesn’t seem that clear-cut: it would have to be true that human do worse when they are given examples of reasoning in which answers are swapped (and no other context about what they should do), which doesn’t feel obvious. Humans put some context clues they are not consciously aware of in text they generate, but that doesn’t mean that they can’t use them.
Btw, this framing is consistent with the fact that humans have personalities because they are “tuned with RL”: they experienced some kind of mode collapse very similar to the one seen in Instruct GPT, which lead to certain phrasing and thoughts to get reinforced. Human personality depends on how you have been raised, and is a bit random, like mode collapse. (But it’s postdiction, so not worth many Bayes points.)
I don’t see a strong conceptual difference between GPTs and models trained with in-fill objectives. In-fill objectives have some advantages for some tasks (like the tasks used to pretrain them), but they have the same properties when it comes to the number of sequential steps in one in-fill step vs multiple sequential in-fills. Like GPTs, they aren’t easy to understand in one in-fill step, but I expect sequential in-fills to be close to human thoughts, though I haven’t seen exotic examples of sequential in-fills which make use of the fact that you can fill text in any order. (Fill-in models are also harder to train at scale, right? Do some people still expect these kinds of model to play a role in the road to AGI?)
Instruction fine-tuning is fine-tuning a prompt across many questions, right? I’ve heard that they are often not human-understandable, which is not very reassuring. (The prompt might encode increasingly detailed reasons and ways to escape supervision, just as RL could hypothetically learn a hard-coded deception strategy? To be honest, I thought more about the power of GPTs at inference time than at training time, where the framing in terms of sequential steps breaks down.)
What do you mean by RLHF via expert distillation?
The big difference I see between NTP and RL is that in RL the model has opportunities to create “coordination with itself”, which can take the form of hard-coded deception plans which don’t require much thought, or steganography techniques (which is really challenging if you never had a chance to plan for it in advance, but really easy if you are trained with RL with wrong incentives).
At comparable levels of competence, I expect sampling BoN on a human imitator to be much safer than an RLHF’ed or expert distilled model, because there isn’t this opportunity for the model to create “coordination with itself”.
Same for prompt engineering, especially if your search process looks like “look by hand for things that would help humans” and not like “use gradient descent to find a good prompt”, where you have way less control over the information that is used by the model (adversarial attacks that push the right knobs inside the model look might be the way to get the most performance, but it also means you don’t know what these knobs are). Still looks better than pure RL since it only helps a little to create “coordination with itself”.
But yeah, there is a big spectrum between pure NTP and pure RL, and I don’t have enough conceptual clarity to draw a clear boundary.
I agree that it seems unlikely that GPTs in the near future will do zero shot stenography, but why do you expect that GPTs will fail to incorporate system 2 logic into its system 1? Is this mainly a competitiveness reason?
The core claim only holds for competitiveness reason. I would expect a “Dyson Sphere GPT” to be far beyond humans at everything using System 1 alone. But I still need the assumption that “things hard for human’s System1 are hard for GPTs’ System 1″. This one is wrong for some task where GPTs’ have ungodly amount of training (like “a vs the” or pure NTP), but I expect it to hold for all actually relevant task because the number of serial steps is of the same order of magnitude, and humans can roughly match GPTs width given enough training. For example, I expect that if some human try hard to solve multiple choice math tests using their intuition (for example, having to answer just after it has been read to them at 2x speed), they will crush GPTs until ~AGI.
Given that GPT is imitating humans who do deceive each other and themselves, including many who are much more sophisticated liars than a 5 year old + mind reader, I’m not sure why you’re so confident that we’d get so much transparency by default.
I feel like mimicking liars doesn’t teach you to lie well enough to fool humans on purpose. I might have been mistaken to bring up self-deception. Knowing how to do “simple lies” like “say A when you know that B” is quite useful in many parts of Webtext, but I can’t think of kinds of text where you actually need to do “think in your lie” (i.e. put information in your lie you will use to think about your ulterior motives better). Humans don’t have to do that because for complex lies, humans have the luxury of not having to say everything they are thinking about! Therefore, I would be surprised there was enough Webtext to teach you the kind of deception strategy which you need to pull off to plan within the prompt while being watched with models not strong enough to do human-level AI Research. Please tell me if you can think of a kind of text where lies where you “think through your lies” would be useful for NTP!
A prompt including these two incorrect reasoning example can have better performance over some human engineered CoT prompts.
Thanks for these surprising CoT example!
I’m not sure how surprising it is that CoT examples with bad reasoning makes the model generate good CoT, nor how relevant it is. For example, I find it more troubling that prompt optimization finds incomprehensible prompts. But on the other hand, I have never seen models successfully generate absurd but useful CoT, and I expect they won’t be able to because they haven’t been trained to generate CoT useful to themselves. Therefore, I expect that the only way model can generate CoT useful to themselves is to generate CoT useful to humans (which they ~know how to do), and use the information humans find useful they have put in the CoT. I would be surprised if high likelihood CoT really useful to humans were also really useful to models for drastically different reasons.
(Note: figuring out the kind of reasoning would be useful is not that easy, even for humans, and I’m not sure teachers are great at figuring out how to best explain how to reason. I wouldn’t be surprised if 7-graders had better performances with AQuA horrible examples than with some human engineered prompts.)
Interesting post!
What about LMs trained with say, in-fill objectives (like UL2, GLM, T5, etc.?
I’m curious what you think about instruction finetuning or RLHF via expert distillation, both of which can be done using a next-token prediction objective—do you think this would still lead to intepretable cognition? Would this be closer to NPT on all text or RL?
If so, what’s the difference between RLHF and instruction finetuning or expert distillation? Why can’t we cast all RL tasks into “imitate the good trajectories using next token prediction”?
What about BoN selection? As in, take a reward function, sample N trajectories, and then select the best? If you sample best of N, the expected cost of this policy is at most N times the expected cost of the base policy. At comparable levels of competence, do you expect sampling BoN on a human imitator to be safer than an RLHF’ed or expert distilled model?
Where does prompt engineering fall into the spectrum between RL and NPT on web corpora?
I also have a few questions/confusions about the headline claim:
Human text output contains not just the thoughts of people, but the results of significant amounts of thinking. For example, an answer sheet for a multiple choice math test does not contain the cognition used to generate the answers. Shouldn’t this create an incentive for sequential foundation models (like GPTs) to embody more and more thoughts that humans do via system 2 into its “system 1”, so to speak?
I agree that it seems unlikely that GPTs in the near future will do zero shot stenography, but why do you expect that GPTs will fail to incorporate system 2 logic into its system 1? Is this mainly a competitiveness reason? (I.e. if GPT-5 can do a task that requires system 2 capabilities without CoT, then GPT-4 can do the same task with CoT, and so the first dangerous AIs built via GPT + CoT will rely on CoT to do the system 2 tasks?)
Given that GPT is imitating humans who do deceive each other and themselves, including many who are much more sophisticated liars than a 5 year old + mind reader, I’m not sure why you’re so confident that we’d get so much transparency by default.
My main contention with the post is that I’m not sure that chain of thought necessarily works the way it intuitively does for humans, in the same way that prompting may not work the same way that prompting humans does! For example, a recent paper did automatic CoT prompts for a bunch of datasets, and some of the CoT examples from the AQuA dataset are pretty horrifying:
A prompt including these two incorrect reasoning example can have better performance over some human engineered CoT prompts.
[EDIT: see also gwern’s comments below]
Yes. An even better example would be a different paper showing that you can simply shuffle all the answers in the few-shot prompt to improve over zero-shot. This was tweeted with the snark ‘how is few-shot learning “meta-learning” if the answers don’t need to be right?’ But of course, there’s no problem with that. All meta-learning is about is about solving the problem, not about respecting your preconceived notions of how the model ‘ought’ to compute. (Many optimal strategies, such as exact ones obtained by dynamic programming, can look bizarre to humans, so that’s not a good criteria.)
This is because contrary to OP, a GPT model is not doing something as simple as reasoning explicitly through text or System I pattern-matching. (Or at least, if it is, the meaning of those two things are much broader and more opaque than one would think and so merely replaces an enigma with a mystery.) A GPT model doing few-shot learning is doing meta-learning: it is solving a (very large) family of related tasks using informative priors to efficiently and Bayes-optimally infer the latent variables of each specific problem (such as the agent involved, competence, spelling ability etc) in order to minimize predictive loss. Including example questions with the wrong answers is still useful if it helps narrow down the uncertainty and elicit the right final answer. There are many properties of any piece of text which go beyond merely being ‘the right answer’, such as the a priori range of numbers or the formatting of the answer or the average length or… These can be more useful than mere correctness. Just as in real life, when you see a confusing question and then see an incorrect answer, that can often be very enlightening as to the question asker’s expectations & assumptions, and you then know how to answer it. (‘Bad’ prompts may also just stochastically happen to tickle a given model in a way that favors the ‘right’ family of tasks—think adversarial examples but in a more benign ‘machine teaching’ way.)
Prompts are not ‘right’ or ‘wrong’; they are programs, which only have meaning with respect to the larger family of prompts/tasks which are the context in which they are interpreted by the neural computer. Since you still don’t know what that computer is or how the prompt is being interpreted, you can’t say that the baseline model is reasoning how you think it is based on a naive human-style reading of the prompt. It obviously is not reasoning like that! RLHF will make it worse, but the default GPT behavior given a prompt is already opaque and inscrutable and alien.
(If you don’t like the meta-learning perspective, then I would point out the literature on LMs ‘cheating’ by finding various dataset biases and shortcuts to achieve high scores, often effectively solving tasks, while not learning the abilities that one assumed was necessary to solve those tasks. They look like they are reasoning, they get the right answer… and then it turns out you can, say, remove the input and still get high scores, or something like that.)
I broadly agree, though I haven’t thought enough to be certain in either view.
Yeah, I thought about this result too, though I couldn’t find it quickly enough to reference it.
What’s in your view the difference between GPTs and the brain? Isn’t the brain also doing meta-learning when you “sample your next thought”? I never said System 1 was only doing pattern matching. System 1 can definitely do very complex things (for example, in real time strategy game, great players often rely only on System 1 to take strategic decisions). I’m pretty sure your System 1 is solving a (very large) family of related tasks using informative priors to efficiently and Bayes-optimally infer the latent variables of each specific problem (but you’re only aware of what gets sampled). Still, System 1 is limited by the number of serial steps, which is why I think our prior on what System 1 can do should put a very low weight on “it simulates an agent which reasons from first principles that it should take control of the future and finds a good plan to do so”.
If your main point of disagreement is “GPT is using different information in the next than humans” because it has been found that GPT used information humans can’t use, I would like to have a clear example of that. The one you give doesn’t seem that clear-cut: it would have to be true that human do worse when they are given examples of reasoning in which answers are swapped (and no other context about what they should do), which doesn’t feel obvious. Humans put some context clues they are not consciously aware of in text they generate, but that doesn’t mean that they can’t use them.
Btw, this framing is consistent with the fact that humans have personalities because they are “tuned with RL”: they experienced some kind of mode collapse very similar to the one seen in Instruct GPT, which lead to certain phrasing and thoughts to get reinforced. Human personality depends on how you have been raised, and is a bit random, like mode collapse. (But it’s postdiction, so not worth many Bayes points.)
Thanks for your detailed feedback!
I don’t see a strong conceptual difference between GPTs and models trained with in-fill objectives. In-fill objectives have some advantages for some tasks (like the tasks used to pretrain them), but they have the same properties when it comes to the number of sequential steps in one in-fill step vs multiple sequential in-fills. Like GPTs, they aren’t easy to understand in one in-fill step, but I expect sequential in-fills to be close to human thoughts, though I haven’t seen exotic examples of sequential in-fills which make use of the fact that you can fill text in any order. (Fill-in models are also harder to train at scale, right? Do some people still expect these kinds of model to play a role in the road to AGI?)
Instruction fine-tuning is fine-tuning a prompt across many questions, right? I’ve heard that they are often not human-understandable, which is not very reassuring. (The prompt might encode increasingly detailed reasons and ways to escape supervision, just as RL could hypothetically learn a hard-coded deception strategy? To be honest, I thought more about the power of GPTs at inference time than at training time, where the framing in terms of sequential steps breaks down.)
What do you mean by RLHF via expert distillation?
The big difference I see between NTP and RL is that in RL the model has opportunities to create “coordination with itself”, which can take the form of hard-coded deception plans which don’t require much thought, or steganography techniques (which is really challenging if you never had a chance to plan for it in advance, but really easy if you are trained with RL with wrong incentives).
At comparable levels of competence, I expect sampling BoN on a human imitator to be much safer than an RLHF’ed or expert distilled model, because there isn’t this opportunity for the model to create “coordination with itself”.
Same for prompt engineering, especially if your search process looks like “look by hand for things that would help humans” and not like “use gradient descent to find a good prompt”, where you have way less control over the information that is used by the model (adversarial attacks that push the right knobs inside the model look might be the way to get the most performance, but it also means you don’t know what these knobs are). Still looks better than pure RL since it only helps a little to create “coordination with itself”.
But yeah, there is a big spectrum between pure NTP and pure RL, and I don’t have enough conceptual clarity to draw a clear boundary.
The core claim only holds for competitiveness reason. I would expect a “Dyson Sphere GPT” to be far beyond humans at everything using System 1 alone. But I still need the assumption that “things hard for human’s System1 are hard for GPTs’ System 1″. This one is wrong for some task where GPTs’ have ungodly amount of training (like “a vs the” or pure NTP), but I expect it to hold for all actually relevant task because the number of serial steps is of the same order of magnitude, and humans can roughly match GPTs width given enough training. For example, I expect that if some human try hard to solve multiple choice math tests using their intuition (for example, having to answer just after it has been read to them at 2x speed), they will crush GPTs until ~AGI.
I feel like mimicking liars doesn’t teach you to lie well enough to fool humans on purpose. I might have been mistaken to bring up self-deception. Knowing how to do “simple lies” like “say A when you know that B” is quite useful in many parts of Webtext, but I can’t think of kinds of text where you actually need to do “think in your lie” (i.e. put information in your lie you will use to think about your ulterior motives better). Humans don’t have to do that because for complex lies, humans have the luxury of not having to say everything they are thinking about! Therefore, I would be surprised there was enough Webtext to teach you the kind of deception strategy which you need to pull off to plan within the prompt while being watched with models not strong enough to do human-level AI Research. Please tell me if you can think of a kind of text where lies where you “think through your lies” would be useful for NTP!
Thanks for these surprising CoT example!
I’m not sure how surprising it is that CoT examples with bad reasoning makes the model generate good CoT, nor how relevant it is. For example, I find it more troubling that prompt optimization finds incomprehensible prompts. But on the other hand, I have never seen models successfully generate absurd but useful CoT, and I expect they won’t be able to because they haven’t been trained to generate CoT useful to themselves. Therefore, I expect that the only way model can generate CoT useful to themselves is to generate CoT useful to humans (which they ~know how to do), and use the information humans find useful they have put in the CoT. I would be surprised if high likelihood CoT really useful to humans were also really useful to models for drastically different reasons.
(Note: figuring out the kind of reasoning would be useful is not that easy, even for humans, and I’m not sure teachers are great at figuring out how to best explain how to reason. I wouldn’t be surprised if 7-graders had better performances with AQuA horrible examples than with some human engineered prompts.)