I don’t see a strong conceptual difference between GPTs and models trained with in-fill objectives. In-fill objectives have some advantages for some tasks (like the tasks used to pretrain them), but they have the same properties when it comes to the number of sequential steps in one in-fill step vs multiple sequential in-fills. Like GPTs, they aren’t easy to understand in one in-fill step, but I expect sequential in-fills to be close to human thoughts, though I haven’t seen exotic examples of sequential in-fills which make use of the fact that you can fill text in any order. (Fill-in models are also harder to train at scale, right? Do some people still expect these kinds of model to play a role in the road to AGI?)
Instruction fine-tuning is fine-tuning a prompt across many questions, right? I’ve heard that they are often not human-understandable, which is not very reassuring. (The prompt might encode increasingly detailed reasons and ways to escape supervision, just as RL could hypothetically learn a hard-coded deception strategy? To be honest, I thought more about the power of GPTs at inference time than at training time, where the framing in terms of sequential steps breaks down.)
What do you mean by RLHF via expert distillation?
The big difference I see between NTP and RL is that in RL the model has opportunities to create “coordination with itself”, which can take the form of hard-coded deception plans which don’t require much thought, or steganography techniques (which is really challenging if you never had a chance to plan for it in advance, but really easy if you are trained with RL with wrong incentives).
At comparable levels of competence, I expect sampling BoN on a human imitator to be much safer than an RLHF’ed or expert distilled model, because there isn’t this opportunity for the model to create “coordination with itself”.
Same for prompt engineering, especially if your search process looks like “look by hand for things that would help humans” and not like “use gradient descent to find a good prompt”, where you have way less control over the information that is used by the model (adversarial attacks that push the right knobs inside the model look might be the way to get the most performance, but it also means you don’t know what these knobs are). Still looks better than pure RL since it only helps a little to create “coordination with itself”.
But yeah, there is a big spectrum between pure NTP and pure RL, and I don’t have enough conceptual clarity to draw a clear boundary.
I agree that it seems unlikely that GPTs in the near future will do zero shot stenography, but why do you expect that GPTs will fail to incorporate system 2 logic into its system 1? Is this mainly a competitiveness reason?
The core claim only holds for competitiveness reason. I would expect a “Dyson Sphere GPT” to be far beyond humans at everything using System 1 alone. But I still need the assumption that “things hard for human’s System1 are hard for GPTs’ System 1″. This one is wrong for some task where GPTs’ have ungodly amount of training (like “a vs the” or pure NTP), but I expect it to hold for all actually relevant task because the number of serial steps is of the same order of magnitude, and humans can roughly match GPTs width given enough training. For example, I expect that if some human try hard to solve multiple choice math tests using their intuition (for example, having to answer just after it has been read to them at 2x speed), they will crush GPTs until ~AGI.
Given that GPT is imitating humans who do deceive each other and themselves, including many who are much more sophisticated liars than a 5 year old + mind reader, I’m not sure why you’re so confident that we’d get so much transparency by default.
I feel like mimicking liars doesn’t teach you to lie well enough to fool humans on purpose. I might have been mistaken to bring up self-deception. Knowing how to do “simple lies” like “say A when you know that B” is quite useful in many parts of Webtext, but I can’t think of kinds of text where you actually need to do “think in your lie” (i.e. put information in your lie you will use to think about your ulterior motives better). Humans don’t have to do that because for complex lies, humans have the luxury of not having to say everything they are thinking about! Therefore, I would be surprised there was enough Webtext to teach you the kind of deception strategy which you need to pull off to plan within the prompt while being watched with models not strong enough to do human-level AI Research. Please tell me if you can think of a kind of text where lies where you “think through your lies” would be useful for NTP!
A prompt including these two incorrect reasoning example can have better performance over some human engineered CoT prompts.
Thanks for these surprising CoT example!
I’m not sure how surprising it is that CoT examples with bad reasoning makes the model generate good CoT, nor how relevant it is. For example, I find it more troubling that prompt optimization finds incomprehensible prompts. But on the other hand, I have never seen models successfully generate absurd but useful CoT, and I expect they won’t be able to because they haven’t been trained to generate CoT useful to themselves. Therefore, I expect that the only way model can generate CoT useful to themselves is to generate CoT useful to humans (which they ~know how to do), and use the information humans find useful they have put in the CoT. I would be surprised if high likelihood CoT really useful to humans were also really useful to models for drastically different reasons.
(Note: figuring out the kind of reasoning would be useful is not that easy, even for humans, and I’m not sure teachers are great at figuring out how to best explain how to reason. I wouldn’t be surprised if 7-graders had better performances with AQuA horrible examples than with some human engineered prompts.)
Thanks for your detailed feedback!
I don’t see a strong conceptual difference between GPTs and models trained with in-fill objectives. In-fill objectives have some advantages for some tasks (like the tasks used to pretrain them), but they have the same properties when it comes to the number of sequential steps in one in-fill step vs multiple sequential in-fills. Like GPTs, they aren’t easy to understand in one in-fill step, but I expect sequential in-fills to be close to human thoughts, though I haven’t seen exotic examples of sequential in-fills which make use of the fact that you can fill text in any order. (Fill-in models are also harder to train at scale, right? Do some people still expect these kinds of model to play a role in the road to AGI?)
Instruction fine-tuning is fine-tuning a prompt across many questions, right? I’ve heard that they are often not human-understandable, which is not very reassuring. (The prompt might encode increasingly detailed reasons and ways to escape supervision, just as RL could hypothetically learn a hard-coded deception strategy? To be honest, I thought more about the power of GPTs at inference time than at training time, where the framing in terms of sequential steps breaks down.)
What do you mean by RLHF via expert distillation?
The big difference I see between NTP and RL is that in RL the model has opportunities to create “coordination with itself”, which can take the form of hard-coded deception plans which don’t require much thought, or steganography techniques (which is really challenging if you never had a chance to plan for it in advance, but really easy if you are trained with RL with wrong incentives).
At comparable levels of competence, I expect sampling BoN on a human imitator to be much safer than an RLHF’ed or expert distilled model, because there isn’t this opportunity for the model to create “coordination with itself”.
Same for prompt engineering, especially if your search process looks like “look by hand for things that would help humans” and not like “use gradient descent to find a good prompt”, where you have way less control over the information that is used by the model (adversarial attacks that push the right knobs inside the model look might be the way to get the most performance, but it also means you don’t know what these knobs are). Still looks better than pure RL since it only helps a little to create “coordination with itself”.
But yeah, there is a big spectrum between pure NTP and pure RL, and I don’t have enough conceptual clarity to draw a clear boundary.
The core claim only holds for competitiveness reason. I would expect a “Dyson Sphere GPT” to be far beyond humans at everything using System 1 alone. But I still need the assumption that “things hard for human’s System1 are hard for GPTs’ System 1″. This one is wrong for some task where GPTs’ have ungodly amount of training (like “a vs the” or pure NTP), but I expect it to hold for all actually relevant task because the number of serial steps is of the same order of magnitude, and humans can roughly match GPTs width given enough training. For example, I expect that if some human try hard to solve multiple choice math tests using their intuition (for example, having to answer just after it has been read to them at 2x speed), they will crush GPTs until ~AGI.
I feel like mimicking liars doesn’t teach you to lie well enough to fool humans on purpose. I might have been mistaken to bring up self-deception. Knowing how to do “simple lies” like “say A when you know that B” is quite useful in many parts of Webtext, but I can’t think of kinds of text where you actually need to do “think in your lie” (i.e. put information in your lie you will use to think about your ulterior motives better). Humans don’t have to do that because for complex lies, humans have the luxury of not having to say everything they are thinking about! Therefore, I would be surprised there was enough Webtext to teach you the kind of deception strategy which you need to pull off to plan within the prompt while being watched with models not strong enough to do human-level AI Research. Please tell me if you can think of a kind of text where lies where you “think through your lies” would be useful for NTP!
Thanks for these surprising CoT example!
I’m not sure how surprising it is that CoT examples with bad reasoning makes the model generate good CoT, nor how relevant it is. For example, I find it more troubling that prompt optimization finds incomprehensible prompts. But on the other hand, I have never seen models successfully generate absurd but useful CoT, and I expect they won’t be able to because they haven’t been trained to generate CoT useful to themselves. Therefore, I expect that the only way model can generate CoT useful to themselves is to generate CoT useful to humans (which they ~know how to do), and use the information humans find useful they have put in the CoT. I would be surprised if high likelihood CoT really useful to humans were also really useful to models for drastically different reasons.
(Note: figuring out the kind of reasoning would be useful is not that easy, even for humans, and I’m not sure teachers are great at figuring out how to best explain how to reason. I wouldn’t be surprised if 7-graders had better performances with AQuA horrible examples than with some human engineered prompts.)