In FBAI’s COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.
first it thinks with tokens
then they replace one thinking step with a latent <thought> token
then 2
...
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
That’s definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It’s being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a better decision) and so standard supervised learning is impossible, requiring exploration. Conceptually, this is little different from, say, training a humanoid robot NN to reach a distant point in fewer actions: it can be a hard exploration problem (most sequences of joint torques or actions simply result in a robot having a seizure while laying on the ground going nowhere), where you want to eventually reach the minimal sequence (to minimize energy / wear-and-tear / time) and you start by solving the problem in any way possible, rewarding solely on the final success, and then reward-shape into a desirable answer, which in effect breaks up the hard original problem into two more feasible problems in a curriculum - ‘reach the target ever’ followed by ‘improve a target-reaching sequence of actions to be shorter’.
In FBAI’s COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.
first it thinks with tokens
then they replace one thinking step with a latent <thought> token
then 2
...
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
There’s another relevant paper “Compressed Chain of Thought: Efficient Reasoning through Dense Representations” which used teacher forcing. Although I haven’t read the whole thing yet.
That’s definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It’s being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a better decision) and so standard supervised learning is impossible, requiring exploration. Conceptually, this is little different from, say, training a humanoid robot NN to reach a distant point in fewer actions: it can be a hard exploration problem (most sequences of joint torques or actions simply result in a robot having a seizure while laying on the ground going nowhere), where you want to eventually reach the minimal sequence (to minimize energy / wear-and-tear / time) and you start by solving the problem in any way possible, rewarding solely on the final success, and then reward-shape into a desirable answer, which in effect breaks up the hard original problem into two more feasible problems in a curriculum - ‘reach the target ever’ followed by ‘improve a target-reaching sequence of actions to be shorter’.