Right now, it seems to be important to not restrict the transcripts at all. This is a hard exploration problem, where most of the answers are useless, and it takes a lot of time for correct answers to finally emerge. Given that, you need to keep the criteria as relaxed as possible, as they are already on the verge of impossibility.
The r1, the other guys, and OAers too on Twitter now seem to emphasize that the obvious appealing approach of rewarding tokens for predicted correctness or doing search on tokens, just doesn’t work (right now). You need to ‘let the LLMs yap’ until they reach the final correct answer. This appears to be the reason for the bizarre non sequiturs or multi-lingual diversions in transcripts—that’s just the cost of rolling out solution attempts which can go anywhere and keeping the winners. They will do all sorts of things which are unnecessary (and conversely, omit tokens which are ‘necessary’). Think of it as the equivalent of how DRL agents will ‘jitter’ and take many unnecessary actions, because those actions don’t change the final reward more than epsilon, and the RL feedback just isn’t rich enough to say ‘you don’t need to bounce up and down randomly while waiting for the ball to bounce back, that doesn’t actually help or hurt you’ (and if you try to reward-shape away those wasteful movements, you may discover your DRL agent converges to a local optimum where it doesn’t do anything, ever, because the jitters served to explore the environment and find new tricks, and you made it too expensive to try useless-seeming tricks so it never found any payoffs or laddered its way up in capabilities).
So you wouldn’t want to impose constraints like ‘must be 100% correct valid Lean proof’. Because it is hard enough to find a ‘correct’ transcript even when you don’t penalize it for spending a while yapping in Japanese or pseudo-skipping easy steps by not writing them down. If you imposed constraints like that, instead of rolling out 1000 episodes and getting 1 useful transcript and the bootstrap working, you’d get 0 useful transcripts and it’d go nowhere.
What you might do is impose a curriculum: solve it any way you can at first, then solve it the right way. Once you have your o1 bootstrap working and have seen large capability gains, you can go back and retrain on the easiest problems with stricter criteria, and work your way back up through the capability levels, but now in some superior way. (In the DRL agent context, you might train to convergence and only then impose a very, very small penalty on each movement, and gradually ramp it up until the performance degrades a little bit but it’s no longer jittering.) The same way you might be taught something informally, and then only much later, after you’ve worked with it a lot, do you go back and learn or prove it rigorously. You might impose a progressive shrinking constraint, for example, where the transcript has to be fewer tokens each time, in order to distill the knowledge into the forward passes to make it vastly cheaper to run (even cheaper, for hard problems, than simply training a small dumb model on the transcripts). You might try to iron out the irrelevancies and digressions by having a judge/critic LLM delete irrelevant parts. You might try to eliminate steganography by rewriting the entire transcript using a different model. Or you might simply prompt it to write a proof in Lean, and score it by whether the final answer validates.
There’s still my original question of where the feedback comes from. You say keep the transcripts where the final answer is correct, but how do you know the final answer? And how do you come up with the question?
What seems to be going on is that these models are actually quite supervised, despite everyone’s insistence on calling them unsupervised RL. The questions and answers appear to be high-quality human annotation instead of being machine generated. Let me know if I’m wrong about this.
If I’m right, it has implications for scaling. You need human annotators to scale, and you need to annotate increasingly hard problems. You don’t get to RL your way to infinite skill like alphazero; if, say, the Riemann hypothesis turns out to be like 3 OOMs of difficulty beyond what humans can currently annotate, then this type of training will never solve Riemann no matter how you scale.
In FBAI’s COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.
first it thinks with tokens
then they replace one thinking step with a latent <thought> token
then 2
...
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
That’s definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It’s being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a better decision) and so standard supervised learning is impossible, requiring exploration. Conceptually, this is little different from, say, training a humanoid robot NN to reach a distant point in fewer actions: it can be a hard exploration problem (most sequences of joint torques or actions simply result in a robot having a seizure while laying on the ground going nowhere), where you want to eventually reach the minimal sequence (to minimize energy / wear-and-tear / time) and you start by solving the problem in any way possible, rewarding solely on the final success, and then reward-shape into a desirable answer, which in effect breaks up the hard original problem into two more feasible problems in a curriculum - ‘reach the target ever’ followed by ‘improve a target-reaching sequence of actions to be shorter’.
Right now, it seems to be important to not restrict the transcripts at all. This is a hard exploration problem, where most of the answers are useless, and it takes a lot of time for correct answers to finally emerge. Given that, you need to keep the criteria as relaxed as possible, as they are already on the verge of impossibility.
The r1, the other guys, and OAers too on Twitter now seem to emphasize that the obvious appealing approach of rewarding tokens for predicted correctness or doing search on tokens, just doesn’t work (right now). You need to ‘let the LLMs yap’ until they reach the final correct answer. This appears to be the reason for the bizarre non sequiturs or multi-lingual diversions in transcripts—that’s just the cost of rolling out solution attempts which can go anywhere and keeping the winners. They will do all sorts of things which are unnecessary (and conversely, omit tokens which are ‘necessary’). Think of it as the equivalent of how DRL agents will ‘jitter’ and take many unnecessary actions, because those actions don’t change the final reward more than epsilon, and the RL feedback just isn’t rich enough to say ‘you don’t need to bounce up and down randomly while waiting for the ball to bounce back, that doesn’t actually help or hurt you’ (and if you try to reward-shape away those wasteful movements, you may discover your DRL agent converges to a local optimum where it doesn’t do anything, ever, because the jitters served to explore the environment and find new tricks, and you made it too expensive to try useless-seeming tricks so it never found any payoffs or laddered its way up in capabilities).
So you wouldn’t want to impose constraints like ‘must be 100% correct valid Lean proof’. Because it is hard enough to find a ‘correct’ transcript even when you don’t penalize it for spending a while yapping in Japanese or pseudo-skipping easy steps by not writing them down. If you imposed constraints like that, instead of rolling out 1000 episodes and getting 1 useful transcript and the bootstrap working, you’d get 0 useful transcripts and it’d go nowhere.
What you might do is impose a curriculum: solve it any way you can at first, then solve it the right way. Once you have your o1 bootstrap working and have seen large capability gains, you can go back and retrain on the easiest problems with stricter criteria, and work your way back up through the capability levels, but now in some superior way. (In the DRL agent context, you might train to convergence and only then impose a very, very small penalty on each movement, and gradually ramp it up until the performance degrades a little bit but it’s no longer jittering.) The same way you might be taught something informally, and then only much later, after you’ve worked with it a lot, do you go back and learn or prove it rigorously. You might impose a progressive shrinking constraint, for example, where the transcript has to be fewer tokens each time, in order to distill the knowledge into the forward passes to make it vastly cheaper to run (even cheaper, for hard problems, than simply training a small dumb model on the transcripts). You might try to iron out the irrelevancies and digressions by having a judge/critic LLM delete irrelevant parts. You might try to eliminate steganography by rewriting the entire transcript using a different model. Or you might simply prompt it to write a proof in Lean, and score it by whether the final answer validates.
There’s still my original question of where the feedback comes from. You say keep the transcripts where the final answer is correct, but how do you know the final answer? And how do you come up with the question?
What seems to be going on is that these models are actually quite supervised, despite everyone’s insistence on calling them unsupervised RL. The questions and answers appear to be high-quality human annotation instead of being machine generated. Let me know if I’m wrong about this.
If I’m right, it has implications for scaling. You need human annotators to scale, and you need to annotate increasingly hard problems. You don’t get to RL your way to infinite skill like alphazero; if, say, the Riemann hypothesis turns out to be like 3 OOMs of difficulty beyond what humans can currently annotate, then this type of training will never solve Riemann no matter how you scale.
So how could I have thought that faster might actually be a sensible training trick for reasoning models.
In FBAI’s COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.
first it thinks with tokens
then they replace one thinking step with a latent <thought> token
then 2
...
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
There’s another relevant paper “Compressed Chain of Thought: Efficient Reasoning through Dense Representations” which used teacher forcing. Although I haven’t read the whole thing yet.
That’s definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It’s being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a better decision) and so standard supervised learning is impossible, requiring exploration. Conceptually, this is little different from, say, training a humanoid robot NN to reach a distant point in fewer actions: it can be a hard exploration problem (most sequences of joint torques or actions simply result in a robot having a seizure while laying on the ground going nowhere), where you want to eventually reach the minimal sequence (to minimize energy / wear-and-tear / time) and you start by solving the problem in any way possible, rewarding solely on the final success, and then reward-shape into a desirable answer, which in effect breaks up the hard original problem into two more feasible problems in a curriculum - ‘reach the target ever’ followed by ‘improve a target-reaching sequence of actions to be shorter’.