Thanks for the detailed outline! FYI I’m a PhD student with some experience LLM-RL, so I may have the time / skills / compute to potentially collaborate on this.
I like “Problem One (Capabilities): …” I agree that LLMs lack of ability to “build their own training” by self-reflection as humans do seems like a current limitation. I really like the focus on safety research that also contributes to capabilities as it is much more likely to be adopted. I think that is a productive and realistic mindset!
Some concerns with your proposal are that
I think you are seriously underestimating the cost of the “reviewer” LLM. From experience, generating sequences (7B models, 2048 max seq len) takes up >90% of the training time in PPO which has a Value Model. I think properly deliberating on each sequence will require a reviewer that is at least as large as the student and, optimistically, has reasoning sequences that have context 3-4x as long as the students answer (roughly: at least 2x just to have a single reflective sentence for each sentence outputted by the student + some more to consider how it fits in the broader context). The cost for generation grows about quadraticly with the sequence length and this means you are spending the (vast) majority of your compute critiquing and not generating.
Note that autoregressive generation is more expensive then all-at once evaluation, so we would expect your generation to be much more expensive than the PPO Value Model.
If I interpret it correctly, the lesson of modern day LLM(-RL) as exemplified by GRPO is to not critique carefully and instead iterate a lot of times in the direction of a “naive” signal. GRPO has the simplest possible token-level reward (the advantage of the entire sequence!) and does remarkably well. GRPO produces reasoning traces that are mostly good because on average, those will lead to the correct answer more often. With GRPO you get to spend ~100% of your compute on generation.
Besides, I think you are overestimating the ability of current LMs to dissect reasoning and determine which parts are meaningful. This seems like a hard task that even a human would struggle with. While the Nanda paper you link gives some evidence of breaking up reasoning into more digestable blocks via complex methods. I see no evidence that current LMs can perform this task well. (if you want to train the reviewer model to perform this task better, you run into additional issues).
Additionally, I’m not sure why you expect that training via DCT increases CoT faithfulness. Wouldn’t the model learn to optimize both the reviewer CoT reward and the correct-answer reward by outputting reasoning traces that signal lots of sound reasoning regardless of the final answer? For example, I could imagine the student getting quite performative (e.g. having lots of “wait, let’s verify that!”) which seems to go against CoT faithfulness? It seems to me that to measure CoT faithfulness you have to optimize the explainability / correlation between the model’s reasoning and answers / actions?
Overall I imagine that the computational expense of you approach + LMs inability to critique + current approaches indirectly optimizing for good block-level structure imply this idea wouldn’t work too well in practice. I may be most interested in discussing the potential disagreement about this method producing faithful CoTs, as I think that is broadly an interesting goal to try to pursue via RL.
If you are interested in working on this, consider applying at SPAR (https://sparai.org/program) where I am offering to mentor this project. Applications should open today, and I am confident that I can get funding for this project.
About your critiques:
I fully agree that the reviews will be expensive, but I also think there is no way around this. Consider how long it takes for us humans to reflect on our experiences and to decide “how could I have though that faster”. It can take hours for us to make good learning experiences. And yet we still do it, because it’s worth it.
Speed vs complexity: My take on this is that optimizing for a naive signal quickly is a local optimum. At some point the model has learned everything from this that it can get and more deliberate updates become useful. For comparison, adding conscious thought to apes created humans and we conquered the planet. Adding conscious thought to an amoeba probably wouldn’t have helped much and it would have just died out from the increased energy demands of a more complex brain. There is a tradeoff, and I believe that LLMs are approaching the tipping point where it makes sense to go for more complex learning algorithms, because the simpler ones have already reached the point of diminishing returns.
LLM’s may not be smart enough yet: I agree that this is a concern. I believe that the technique I describe will be very useful as soon as a threshold is reached, but I don’t know if we are at that threshold yet. Worst case scenario: The experiments don’t work. But in that case we can just keep the code and rerun it every time the frontier labs release a new model, until we get lucky and it works.
On the evolutionary pressure towards faithfulness: There are two reward signals: The main reward depends only on “was the result correct?” and so can’t be gamed. The DCA reward by the Reviewer is based on the main reward, only more granular. The sum total reward for the student doesn’t change no matter what performance the model puts on.
I feel that more “deliberate” reward signals could be an interesting direction, but at the same time I feel that the “overarching ML lesson” is that approximating complex signals instead of computing them directly always wins (e.g. PPO vs TRPO). At the very least, I think you’d need to train the reviewer model for this specific task.
However, I think our disagreement on the first 3 points is somewhat fundamental so I’ll put it aside for now. Please let me know if you think more feedback / discussion might be useful!
I am still curious on why you think that the model won’t get performative? e.g. on questions that its correct it will add unnecessary double checking, or make intentional mistakes so it could heroically catch them? Maybe you could try specifying the reward you had more concretely?
Thanks for the detailed outline! FYI I’m a PhD student with some experience LLM-RL, so I may have the time / skills / compute to potentially collaborate on this.
I like “Problem One (Capabilities): …” I agree that LLMs lack of ability to “build their own training” by self-reflection as humans do seems like a current limitation. I really like the focus on safety research that also contributes to capabilities as it is much more likely to be adopted. I think that is a productive and realistic mindset!
Some concerns with your proposal are that
I think you are seriously underestimating the cost of the “reviewer” LLM. From experience, generating sequences (7B models, 2048 max seq len) takes up >90% of the training time in PPO which has a Value Model. I think properly deliberating on each sequence will require a reviewer that is at least as large as the student and, optimistically, has reasoning sequences that have context 3-4x as long as the students answer (roughly: at least 2x just to have a single reflective sentence for each sentence outputted by the student + some more to consider how it fits in the broader context). The cost for generation grows about quadraticly with the sequence length and this means you are spending the (vast) majority of your compute critiquing and not generating.
Note that autoregressive generation is more expensive then all-at once evaluation, so we would expect your generation to be much more expensive than the PPO Value Model.
If I interpret it correctly, the lesson of modern day LLM(-RL) as exemplified by GRPO is to not critique carefully and instead iterate a lot of times in the direction of a “naive” signal. GRPO has the simplest possible token-level reward (the advantage of the entire sequence!) and does remarkably well. GRPO produces reasoning traces that are mostly good because on average, those will lead to the correct answer more often. With GRPO you get to spend ~100% of your compute on generation.
Besides, I think you are overestimating the ability of current LMs to dissect reasoning and determine which parts are meaningful. This seems like a hard task that even a human would struggle with. While the Nanda paper you link gives some evidence of breaking up reasoning into more digestable blocks via complex methods. I see no evidence that current LMs can perform this task well. (if you want to train the reviewer model to perform this task better, you run into additional issues).
Additionally, I’m not sure why you expect that training via DCT increases CoT faithfulness. Wouldn’t the model learn to optimize both the reviewer CoT reward and the correct-answer reward by outputting reasoning traces that signal lots of sound reasoning regardless of the final answer? For example, I could imagine the student getting quite performative (e.g. having lots of “wait, let’s verify that!”) which seems to go against CoT faithfulness? It seems to me that to measure CoT faithfulness you have to optimize the explainability / correlation between the model’s reasoning and answers / actions?
Overall I imagine that the computational expense of you approach + LMs inability to critique + current approaches indirectly optimizing for good block-level structure imply this idea wouldn’t work too well in practice. I may be most interested in discussing the potential disagreement about this method producing faithful CoTs, as I think that is broadly an interesting goal to try to pursue via RL.
Thank you for the detailed feedback!
If you are interested in working on this, consider applying at SPAR (https://sparai.org/program) where I am offering to mentor this project. Applications should open today, and I am confident that I can get funding for this project.
About your critiques:
I fully agree that the reviews will be expensive, but I also think there is no way around this. Consider how long it takes for us humans to reflect on our experiences and to decide “how could I have though that faster”. It can take hours for us to make good learning experiences. And yet we still do it, because it’s worth it.
Speed vs complexity: My take on this is that optimizing for a naive signal quickly is a local optimum. At some point the model has learned everything from this that it can get and more deliberate updates become useful. For comparison, adding conscious thought to apes created humans and we conquered the planet. Adding conscious thought to an amoeba probably wouldn’t have helped much and it would have just died out from the increased energy demands of a more complex brain. There is a tradeoff, and I believe that LLMs are approaching the tipping point where it makes sense to go for more complex learning algorithms, because the simpler ones have already reached the point of diminishing returns.
LLM’s may not be smart enough yet: I agree that this is a concern. I believe that the technique I describe will be very useful as soon as a threshold is reached, but I don’t know if we are at that threshold yet. Worst case scenario: The experiments don’t work. But in that case we can just keep the code and rerun it every time the frontier labs release a new model, until we get lucky and it works.
On the evolutionary pressure towards faithfulness: There are two reward signals: The main reward depends only on “was the result correct?” and so can’t be gamed. The DCA reward by the Reviewer is based on the main reward, only more granular. The sum total reward for the student doesn’t change no matter what performance the model puts on.
Thanks for the reply!
I feel that more “deliberate” reward signals could be an interesting direction, but at the same time I feel that the “overarching ML lesson” is that approximating complex signals instead of computing them directly always wins (e.g. PPO vs TRPO). At the very least, I think you’d need to train the reviewer model for this specific task.
However, I think our disagreement on the first 3 points is somewhat fundamental so I’ll put it aside for now. Please let me know if you think more feedback / discussion might be useful!
I am still curious on why you think that the model won’t get performative? e.g. on questions that its correct it will add unnecessary double checking, or make intentional mistakes so it could heroically catch them? Maybe you could try specifying the reward you had more concretely?
Thanks!