It isn’t able to distinguish between the hypothesis that the capabilities stall is because base models have a much more diverse space of capabilities to sample from, even if RL imparts new capabilities past pass@400, or the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400, but only the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400 actually matters for a limit on RL capabilities.
2. As Asher stated, this would be consistent with a world where RL increased capabilities arbitrarily, so long as they become less diverse, and we don’t have the means to rule out RL increasing capabilities such that you do want to use the reasoning model over the base model on this paper:
There are some pretty important caveats:
It isn’t able to distinguish between the hypothesis that the capabilities stall is because base models have a much more diverse space of capabilities to sample from, even if RL imparts new capabilities past pass@400, or the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400, but only the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400 actually matters for a limit on RL capabilities.
@Jozdien talks more about this below:
https://www.lesswrong.com/posts/s3NaETDujoxj4GbEm/tsinghua-paper-does-rl-really-incentivize-reasoning-capacity#Mkuqt7x7YojpJuCGt
2. As Asher stated, this would be consistent with a world where RL increased capabilities arbitrarily, so long as they become less diverse, and we don’t have the means to rule out RL increasing capabilities such that you do want to use the reasoning model over the base model on this paper:
https://www.lesswrong.com/posts/s3NaETDujoxj4GbEm/tsinghua-paper-does-rl-really-incentivize-reasoning-capacity#FJie6FweyqjqCKTMC