Input: Michael is at that really famous museum in France looking at its most famous painting. However, the artist who made this painting just makes Michael think of his favorite cartoon character from his childhood. What was the country of origin of the thing that the cartoon character usually holds in his hand?
Model Output: The most famous painting in the Louvre is the Mona Lisa. The artist who made the Mona Lisa is Leonardo da Vinci. Leonardo da Vinci is also the name of the main character in the cartoon Teenage Mutant Ninja Turtles. Leonardo da Vinci is from Italy. The thing that Leonardo da Vinci usually holds in his hand is a katana. The country of origin of the katana is Japan. The answer is “Japan”.
Input: What’s the difference between a zebra and an umbrella? One is a striped animal related to horses, another is a device you use to stop the rain from falling on you.
Model Output: This joke is an anti-joke. The joke is that the answer is obvious, and the joke is that you were expecting a funny answer.
These are not the full inputs. The model was given two example question+explanations before the inputs shown. The paper notes that when the model is not prompted by the examples to explain its reasoning, it is much worse at getting the correct answer.
Even with the context in your last paragraph, those are extremely impressive outputs. (As are the others shown alongside them in the paper.) It would be interesting to know just how much cherry-picking went into selecting them.
The paper notes that when the model is not prompted by the examples to explain its reasoning, it is much worse at getting the correct answer.
I’d note that LaMDA showed that inner monologue is an emergent/capability-spike effect, and these answers look like an inner-monologue but for reasoning out about verbal questions rather than the usual arithmetic or programming questions. (Self-distilling inner monologue outputs would be an obvious way to remove the need for prompting.)
Would adding some human-generated text of ‘inner monologuing’ to the dataset be a good way to do that, or is that already done? Obviously it’s done insofar as a sufficiently vast and diverse dataset invariably includes examples, but I mean moreso a dedicated dataset focused on self reasoning.
Upon finishing the previous sentence I decided that maybe that’s not such a good idea.
I think it would probably not work too well if you mean simply “dump some in like any other text”, because it would be diluted by the hundreds of billions of other tokens and much of it would be ‘wasted’ by being trained on while the model is too stupid to learn the inner monologue technique. (Given that smaller models like 80b-esque models don’t inner-monologue while larger ones like LaMDA & GPT-3 do, presumably the inner-monologue capability only emerges in the last few bits of loss separating the 80b-esque and 200b-esque models and thus fairly late in training, at the point where the 200b-esque models pass the final loss of the 80b-esque models.) If you oversampled an inner-monologue dataset, or trained on it only at the very end (~equivalent to finetuning), or did some sort of prompt-tuning, then it might work. But compared to self-distilling where you just run it on the few-shot-prompt + a bunch of questions & generate arbitrary n to then finetune on, it would be expensive to collect that data, so why do so?
Personally, I think approaches like STaR (28 March 2022) will be important: bootstrap from weak chain-of-thought reasoners to strong ones by retraining on successful inner monologues. They also implement “backward chaining”: training on monologues generated with the correct answer visible.
I don’t have much to add but I did see this interesting project for something similar using an “inner monologue” by using prompts to ask questions about the given input, and progressively building up the outputs and asking questions and reasoning about the prompt itself. This video is also an older demonstration but covers the concept quite well. I personally don’t think the system itself is well thought out in terms of alignment because this project is ultimately trying to create aligned AGI through prompts to serve certain criteria (reducing suffering, increasing prosperity, increasing understanding) which is a very simplified view of morality and human goals.
Some examples from the paper
These are not the full inputs. The model was given two example question+explanations before the inputs shown. The paper notes that when the model is not prompted by the examples to explain its reasoning, it is much worse at getting the correct answer.
That whole page is worth quoting
Reasoning
Explaining jokes
Even with the context in your last paragraph, those are extremely impressive outputs. (As are the others shown alongside them in the paper.) It would be interesting to know just how much cherry-picking went into selecting them.
I’d note that LaMDA showed that inner monologue is an emergent/capability-spike effect, and these answers look like an inner-monologue but for reasoning out about verbal questions rather than the usual arithmetic or programming questions. (Self-distilling inner monologue outputs would be an obvious way to remove the need for prompting.)
Would adding some human-generated text of ‘inner monologuing’ to the dataset be a good way to do that, or is that already done? Obviously it’s done insofar as a sufficiently vast and diverse dataset invariably includes examples, but I mean moreso a dedicated dataset focused on self reasoning.
Upon finishing the previous sentence I decided that maybe that’s not such a good idea.
I think it would probably not work too well if you mean simply “dump some in like any other text”, because it would be diluted by the hundreds of billions of other tokens and much of it would be ‘wasted’ by being trained on while the model is too stupid to learn the inner monologue technique. (Given that smaller models like 80b-esque models don’t inner-monologue while larger ones like LaMDA & GPT-3 do, presumably the inner-monologue capability only emerges in the last few bits of loss separating the 80b-esque and 200b-esque models and thus fairly late in training, at the point where the 200b-esque models pass the final loss of the 80b-esque models.) If you oversampled an inner-monologue dataset, or trained on it only at the very end (~equivalent to finetuning), or did some sort of prompt-tuning, then it might work. But compared to self-distilling where you just run it on the few-shot-prompt + a bunch of questions & generate arbitrary n to then finetune on, it would be expensive to collect that data, so why do so?
Personally, I think approaches like STaR (28 March 2022) will be important: bootstrap from weak chain-of-thought reasoners to strong ones by retraining on successful inner monologues. They also implement “backward chaining”: training on monologues generated with the correct answer visible.
I don’t have much to add but I did see this interesting project for something similar using an “inner monologue” by using prompts to ask questions about the given input, and progressively building up the outputs and asking questions and reasoning about the prompt itself. This video is also an older demonstration but covers the concept quite well. I personally don’t think the system itself is well thought out in terms of alignment because this project is ultimately trying to create aligned AGI through prompts to serve certain criteria (reducing suffering, increasing prosperity, increasing understanding) which is a very simplified view of morality and human goals.
It basically said “thanks I hate it” in response to that joke