If you are interested in working on this, consider applying at SPAR (https://sparai.org/program) where I am offering to mentor this project. Applications should open today, and I am confident that I can get funding for this project.
About your critiques:
I fully agree that the reviews will be expensive, but I also think there is no way around this. Consider how long it takes for us humans to reflect on our experiences and to decide “how could I have though that faster”. It can take hours for us to make good learning experiences. And yet we still do it, because it’s worth it.
Speed vs complexity: My take on this is that optimizing for a naive signal quickly is a local optimum. At some point the model has learned everything from this that it can get and more deliberate updates become useful. For comparison, adding conscious thought to apes created humans and we conquered the planet. Adding conscious thought to an amoeba probably wouldn’t have helped much and it would have just died out from the increased energy demands of a more complex brain. There is a tradeoff, and I believe that LLMs are approaching the tipping point where it makes sense to go for more complex learning algorithms, because the simpler ones have already reached the point of diminishing returns.
LLM’s may not be smart enough yet: I agree that this is a concern. I believe that the technique I describe will be very useful as soon as a threshold is reached, but I don’t know if we are at that threshold yet. Worst case scenario: The experiments don’t work. But in that case we can just keep the code and rerun it every time the frontier labs release a new model, until we get lucky and it works.
On the evolutionary pressure towards faithfulness: There are two reward signals: The main reward depends only on “was the result correct?” and so can’t be gamed. The DCA reward by the Reviewer is based on the main reward, only more granular. The sum total reward for the student doesn’t change no matter what performance the model puts on.
I feel that more “deliberate” reward signals could be an interesting direction, but at the same time I feel that the “overarching ML lesson” is that approximating complex signals instead of computing them directly always wins (e.g. PPO vs TRPO). At the very least, I think you’d need to train the reviewer model for this specific task.
However, I think our disagreement on the first 3 points is somewhat fundamental so I’ll put it aside for now. Please let me know if you think more feedback / discussion might be useful!
I am still curious on why you think that the model won’t get performative? e.g. on questions that its correct it will add unnecessary double checking, or make intentional mistakes so it could heroically catch them? Maybe you could try specifying the reward you had more concretely?
Thank you for the detailed feedback!
If you are interested in working on this, consider applying at SPAR (https://sparai.org/program) where I am offering to mentor this project. Applications should open today, and I am confident that I can get funding for this project.
About your critiques:
I fully agree that the reviews will be expensive, but I also think there is no way around this. Consider how long it takes for us humans to reflect on our experiences and to decide “how could I have though that faster”. It can take hours for us to make good learning experiences. And yet we still do it, because it’s worth it.
Speed vs complexity: My take on this is that optimizing for a naive signal quickly is a local optimum. At some point the model has learned everything from this that it can get and more deliberate updates become useful. For comparison, adding conscious thought to apes created humans and we conquered the planet. Adding conscious thought to an amoeba probably wouldn’t have helped much and it would have just died out from the increased energy demands of a more complex brain. There is a tradeoff, and I believe that LLMs are approaching the tipping point where it makes sense to go for more complex learning algorithms, because the simpler ones have already reached the point of diminishing returns.
LLM’s may not be smart enough yet: I agree that this is a concern. I believe that the technique I describe will be very useful as soon as a threshold is reached, but I don’t know if we are at that threshold yet. Worst case scenario: The experiments don’t work. But in that case we can just keep the code and rerun it every time the frontier labs release a new model, until we get lucky and it works.
On the evolutionary pressure towards faithfulness: There are two reward signals: The main reward depends only on “was the result correct?” and so can’t be gamed. The DCA reward by the Reviewer is based on the main reward, only more granular. The sum total reward for the student doesn’t change no matter what performance the model puts on.
Thanks for the reply!
I feel that more “deliberate” reward signals could be an interesting direction, but at the same time I feel that the “overarching ML lesson” is that approximating complex signals instead of computing them directly always wins (e.g. PPO vs TRPO). At the very least, I think you’d need to train the reviewer model for this specific task.
However, I think our disagreement on the first 3 points is somewhat fundamental so I’ll put it aside for now. Please let me know if you think more feedback / discussion might be useful!
I am still curious on why you think that the model won’t get performative? e.g. on questions that its correct it will add unnecessary double checking, or make intentional mistakes so it could heroically catch them? Maybe you could try specifying the reward you had more concretely?
Thanks!