This seems like a Chinese model for superintelligence! (All the authors are Chinese, though a few are working in the West.) Not in the AIXI sense of something which is optimal from the beginning, but rather something that could bootstrap its way to superintelligence. One could compare it to Schmidhuber’s Godel machine concept, but more concrete, and native to the deep learning era.
(If anyone has an argument as to why this isn’t a model that can become arbitrarily intelligent, I’m interested.)
There’s this paper suggesting RLVR (which is what Absolute Zero generates training data for) can’t reach capabilities exceeding those of the base pretrained model at something like pass@400 (depending on the task).
It isn’t able to distinguish between the hypothesis that the capabilities stall is because base models have a much more diverse space of capabilities to sample from, even if RL imparts new capabilities past pass@400, or the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400, but only the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400 actually matters for a limit on RL capabilities.
2. As Asher stated, this would be consistent with a world where RL increased capabilities arbitrarily, so long as they become less diverse, and we don’t have the means to rule out RL increasing capabilities such that you do want to use the reasoning model over the base model on this paper:
That paper is being contradicted by this new NVIDIA paper that shows the opposate using a 1.5B distill of DeepSeek R1. I don’t have much technical knowledge, so a deep dive by someone more knowledgeable would be appreciated, especially in comparison to the Tsinghua paper.
I saw the Nvidia paper, I don’t think the data it presents makes that case. In particular, their “intermediate” checkpoint is too far away from the base model to correctly reference the crossover point (where the base model pass@k intersects the early RLVR pass@k). And the base model choice is strange for a study like this (it already has finetuning on DeepSeek-R1 traces in it, so the base model proper is mixed up with elicitation through R1 traces, when comparing with elicitation through subsequent RLVR).
In some of the plots, the intersection point isn’t visible, and mostly the “final” checkpoint seems to get worse than the “intermediate” checkpoint on pass@k plots at very high k, confirming rather than opposing the point of the Yue et al. paper (regarding the crossover point).
The fact that they’ve plotted pass@16 in Figure 1 as illustrative of the overall framing of the paper suggests that they aren’t grappling with the correct point, because if k=16 is earlier than the crossover point, then of course pass@16 performance will keep increasing. The question is whether it’ll ever exceed the performance at the crossover point.
(Of course, for sufficiently simple problems, RL works and can train a model to do things that the base model can’t do at all. And in principle RL should be able to do this in general, that’s the promise of RL. The question is whether it works for interesting problems that can’t be as easily solved with RL directly, using current methods for doing RLVR. If not, it can’t just be directly scaled to the moon within 1-2 years.)
This seems like a Chinese model for superintelligence! (All the authors are Chinese, though a few are working in the West.) Not in the AIXI sense of something which is optimal from the beginning, but rather something that could bootstrap its way to superintelligence. One could compare it to Schmidhuber’s Godel machine concept, but more concrete, and native to the deep learning era.
(If anyone has an argument as to why this isn’t a model that can become arbitrarily intelligent, I’m interested.)
There’s this paper suggesting RLVR (which is what Absolute Zero generates training data for) can’t reach capabilities exceeding those of the base pretrained model at something like pass@400 (depending on the task).
There are some pretty important caveats:
It isn’t able to distinguish between the hypothesis that the capabilities stall is because base models have a much more diverse space of capabilities to sample from, even if RL imparts new capabilities past pass@400, or the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400, but only the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400 actually matters for a limit on RL capabilities.
@Jozdien talks more about this below:
https://www.lesswrong.com/posts/s3NaETDujoxj4GbEm/tsinghua-paper-does-rl-really-incentivize-reasoning-capacity#Mkuqt7x7YojpJuCGt
2. As Asher stated, this would be consistent with a world where RL increased capabilities arbitrarily, so long as they become less diverse, and we don’t have the means to rule out RL increasing capabilities such that you do want to use the reasoning model over the base model on this paper:
https://www.lesswrong.com/posts/s3NaETDujoxj4GbEm/tsinghua-paper-does-rl-really-incentivize-reasoning-capacity#FJie6FweyqjqCKTMC
That paper is being contradicted by this new NVIDIA paper that shows the opposate using a 1.5B distill of DeepSeek R1. I don’t have much technical knowledge, so a deep dive by someone more knowledgeable would be appreciated, especially in comparison to the Tsinghua paper.
I saw the Nvidia paper, I don’t think the data it presents makes that case. In particular, their “intermediate” checkpoint is too far away from the base model to correctly reference the crossover point (where the base model pass@k intersects the early RLVR pass@k). And the base model choice is strange for a study like this (it already has finetuning on DeepSeek-R1 traces in it, so the base model proper is mixed up with elicitation through R1 traces, when comparing with elicitation through subsequent RLVR).
In some of the plots, the intersection point isn’t visible, and mostly the “final” checkpoint seems to get worse than the “intermediate” checkpoint on pass@k plots at very high k, confirming rather than opposing the point of the Yue et al. paper (regarding the crossover point).
The fact that they’ve plotted pass@16 in Figure 1 as illustrative of the overall framing of the paper suggests that they aren’t grappling with the correct point, because if k=16 is earlier than the crossover point, then of course pass@16 performance will keep increasing. The question is whether it’ll ever exceed the performance at the crossover point.
(Of course, for sufficiently simple problems, RL works and can train a model to do things that the base model can’t do at all. And in principle RL should be able to do this in general, that’s the promise of RL. The question is whether it works for interesting problems that can’t be as easily solved with RL directly, using current methods for doing RLVR. If not, it can’t just be directly scaled to the moon within 1-2 years.)
Thank you for the quick reply.