Why specifically would you expect that RL on coding wouldn’t sufficiently advance coding abilities of LLM‘s to significantly accelerate the search for a better learning algorithm or architecture?
Because “RL on passing precisely defined unit tests” is not “RL on building programs that do what you want”, and is most definitely not “RL on doing novel useful research”.
Ah great point, regarding the comment you link to:
yes, some reward hacking is going on but at least in Claude (which I work with) this is a rare occurrence in daily practice, and usually follows repeated attempts to actually solve the problem.
I believe that both Deepseek R1-Zero as well as Grok thinking were RL-trained solely on math and code yet their reasoning seems to generalise somewhat to other domains as well.
So, while you’re absolutely right that we can’t do RL directly on the most important outcomes (research progress), I believe there will be significant transfer from what we can do RL on currently.
Would be curious to hear what’s your sense of generalisation from the current narrow RL approaches!
Why specifically would you expect that RL on coding wouldn’t sufficiently advance coding abilities of LLM‘s to significantly accelerate the search for a better learning algorithm or architecture?
Because “RL on passing precisely defined unit tests” is not “RL on building programs that do what you want”, and is most definitely not “RL on doing novel useful research”.
Ah great point, regarding the comment you link to:
yes, some reward hacking is going on but at least in Claude (which I work with) this is a rare occurrence in daily practice, and usually follows repeated attempts to actually solve the problem.
I believe that both Deepseek R1-Zero as well as Grok thinking were RL-trained solely on math and code yet their reasoning seems to generalise somewhat to other domains as well.
So, while you’re absolutely right that we can’t do RL directly on the most important outcomes (research progress), I believe there will be significant transfer from what we can do RL on currently.
Would be curious to hear what’s your sense of generalisation from the current narrow RL approaches!