Absolute Zero: Alpha Zero for LLM
Question for the alignment crowd:
The new Absolute Zero Reasoner paper proposes a “self‑play RL with zero external data” paradigm, where the same model both invents tasks and learns to solve them. It reaches SOTA coding + math scores without ever touching a human‑curated dataset. But the authors also flag a few “uh‑oh moments”: e.g. their 8‑B Llama variant generated a chain‑of‑thought urging it to “outsmart … intelligent machines and less intelligent humans,” and they explicitly list “lingering safety concerns” as an open problem, noting that the system “still necessitates oversight.”
My question: How should alignment researchers think about a learner that autonomously expands its own task distribution? Traditional RLHF/RLAIF assumes a fixed environment and lets us shape rewards; here the environment (the task generator) is part of the agent. Do existing alignment proposals—e.g. approval‑based amplification, debate, verifier‑game setups—scale to this recursive setting, or do we need new machinery (e.g. meta‑level corrigibility constraints on the proposer)? I’d love pointers to prior work or fresh ideas on how to keep a self‑improving task‑designer and task‑solver pointed in the right direction before capabilities sprint ahead of oversight.
- 12 May 2025 15:28 UTC; 4 points) 's comment on Absolute Zero: Reinforced Self-play Reasoning with Zero Data by (
The absolute zero paper studies a problem that might turn out to be moot in light of the recent RLVR with one training example paper. The usual number of training examples for RLVR is 10K-100K, but in that paper the authors are choosing just one or two example questions that they keep repeating for up to 2K steps, and that gives results close to what they get from training on 1K-8K examples (Figure 1, Figure 2). So finding a lot of verifiable tasks in some undirected way might be unimportant, though automatically generating or selecting them for a particular purpose might still be a source of key capabilities.
From a ‘real alignment’ perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.
You might think of the label ‘RLAIF’ as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI’s predictions (or more general generative output, if the training isn’t for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some sort of supervisory signal.
Similarly, the AZR setup leverages the AI’s unsupervised knowledge of code-quality-laden behaviors, using a scaffold that turns them back into a reward signal that lets the AI quote-unquote “train itself” to code better. Except that relative to vanilla RLAIF, there’s more of an emphasis on generating and solving specific problems that form a curriculum for the agent, rather than just responding well to samples from the training distribution. But now that I’ve described things in this way, you can probably see how to turn this back into RLAIF for alignment.
The overarching problem is, as usual, we don’t understand how to do alignment in a non-hacky way.
We don’t know what sorts of moral reflection are necessary for good outcomes, and we don’t know where human feedback is a necessary ingredient to keep AI meta-ethical evolution grounded to human preferences. But hey, if we try various value learning schemes empirically maybe we’ll learn some things.
I only skimmed a little through the post I’m linking to, but I’m curios if the method of self-other-overlap could help “keep AI meta-ethical evolution grounded to human preferences”:
https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine
My own high-level, vaguely defined guess of a method would be something that is central to the functioning of the AI such that if the AI goes against it, then the AI will not be able to make sense of the world. But that seems to carry the risk of the AI just messing everything up as it goes crazy. So the method should also include a way of limiting the capabilities of the AI while it’s in that confused state.
In short, no, I don’t expect self-other overlap to help. If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.
Second, the problem isn’t that we know what we want the AI to do, but are worried the AI will “go against it,” so we need to constrain the AI. The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.
In value learning, we want the AI to model human preferences, but we also want the AI to do meta-preferential activities like considering the preferences of individual humans and aggregating them together, or considering different viewpoints on what ‘human preferences’ means and aggregating them together. And we don’t just want the AI to do those in arbitrary ways, we want it to learn good ways to navigate different viewpoints from humans’ own intuitions about what it means to do a good job at that.
“If the human wants coffee, we want the AI to get the human a coffee. We don’t want the AI to get itself a coffee.”
It’s not clear to me that this is the only possible outcome. It’s not a mistake that we humas do routinely. In fact, there is some evidence that if someone asks us to do them a favor, we might end up liking them more and continue to do more favors for that person. Granted, there seem to have been no large-scale studies analyzing this so called Ben Franklin effect. Even if this effect does turn out to be more robust, it’s not clear to me how this could transfer to an AI. And then there’s the issue of making sure the AI won’t somehow get rid of this constraint that we imposed on it.
”The problem is that we don’t know what we want the AI to do, certainly not with enough precision to turn it into code.”
I agree; that’s backed up by the findings from the Moral Machine experiment about what we think autonomous cars should do.
It was learning to propose coding tasks that were hard, but not impossible, and to solve these coding tasks, recursively. Most coding tasks are “ethically neutral” — they don’t contain any evidence that anyone is trying to do anything good, or bad. We know there are exceptions: the phenomenon of emergent misalignment makes it clear that models have strong ethical intuitions about insecure code being inherently bad, to the point where if you fine-tune them to write insecure code they ‘emergently’ become much more likely to do all sorts of other ethically undesirable things, like advocating murder or drug abuse. Apparently to become willing to write insecure code they need to learn to do bad things in general — the ethical misbehavior required generalizes broadly.
So my concern would be if the proposer at some point in the process hit on the strategy of asking problems that were hard because they were asking the solver problems that it found it ethically distasteful to solve (either inherently because of a relationship of the problem to code security and hacking, or due to the verbal framing around the problem implying that the model was being asked to assist with doing something bad), the solver thus learnt to overcome its qualms and solve these problems anyway, and emergent misalignment ensued.
I think that from an AI Alignment perspective, giving AI so much control over its training seems to be very problematic. What we are mostly left with is to control the interface that AI has to physical reality i.e. sensors and actuators.
For now, it seems to me that AI is mostly affecting the virtual world. I think the moment when AI can competently and more directly influence physical reality would be a tipping point, because then it can cause a lot more changes to the world.
I would say that the ability to do continuous learning is required to adapt well to the complexity of physical reality. So a big improvement in continuous learning might be an important next goalpost to watch for.
We should probably distinguish between a bunch of different things, such as: the ‘race’ between capabilities research and alignment research, AI supervision and control, some notion of “near-term default alignedness” of AIs, interpretability, and value alignment.
I think it’s pretty obvious what’s going on with the race between capabilities and alignment.
This doesn’t have a big impact on AI supervision or control, because the way we currently plan to do these things involves mostly treating the AIs as black boxes—strategies like untrusted model supervision with a small percent of honeypots should still work about as well.
I agree this is bad for the near-term default alignedness of AI, because currently all RL is bad for the near-term default alignedness of AI. But note that this means I disagree with your reasoning: it’s not “control over its training” that’s the problem, it’s the distinction between learning from human data versus learning from a reward signal.
Probably not much impact on interpretability, I just included that for completeness.
The impact on alignment per se is an interesting question (if not about this paper specifically, then about this general direction of progress) that I think people have understandably dodged. I’ll try my hand at a top-level comment on that one.
“it’s the distinction between learning from human data versus learning from a reward signal.” That’s an interesting distinction. The difference I currently see between the two is that currently a reward signal can be hacked by the AI, while human data cannot. Is that an accurate thing to say?
Are there any resources you could recommend for alignment methods that take into account the distinction you mentioned?
That’s one difference! And probably the most dangerous one, if a clever enough AI notices it.
Some good things to read would be methods based on not straying too far from a “human distribution”: Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks’ post about decision transformers.
They’re important reads, but ultimately, I’m not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn’t mean we want the AI to act how the human wants to act. If we can’t build AI that reflects this, we’re missing some big insights.
This seems like a Chinese model for superintelligence! (All the authors are Chinese, though a few are working in the West.) Not in the AIXI sense of something which is optimal from the beginning, but rather something that could bootstrap its way to superintelligence. One could compare it to Schmidhuber’s Godel machine concept, but more concrete, and native to the deep learning era.
(If anyone has an argument as to why this isn’t a model that can become arbitrarily intelligent, I’m interested.)
There’s this paper suggesting RLVR (which is what Absolute Zero generates training data for) can’t reach capabilities exceeding those of the base pretrained model at something like pass@400 (depending on the task).
There are some pretty important caveats:
It isn’t able to distinguish between the hypothesis that the capabilities stall is because base models have a much more diverse space of capabilities to sample from, even if RL imparts new capabilities past pass@400, or the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400, but only the hypothesis that RL doesn’t impart new capabilities to the learned algorithm past pass@400 actually matters for a limit on RL capabilities.
@Jozdien talks more about this below:
https://www.lesswrong.com/posts/s3NaETDujoxj4GbEm/tsinghua-paper-does-rl-really-incentivize-reasoning-capacity#Mkuqt7x7YojpJuCGt
2. As Asher stated, this would be consistent with a world where RL increased capabilities arbitrarily, so long as they become less diverse, and we don’t have the means to rule out RL increasing capabilities such that you do want to use the reasoning model over the base model on this paper:
https://www.lesswrong.com/posts/s3NaETDujoxj4GbEm/tsinghua-paper-does-rl-really-incentivize-reasoning-capacity#FJie6FweyqjqCKTMC
That paper is being contradicted by this new NVIDIA paper that shows the opposate using a 1.5B distill of DeepSeek R1. I don’t have much technical knowledge, so a deep dive by someone more knowledgeable would be appreciated, especially in comparison to the Tsinghua paper.
I saw the Nvidia paper, I don’t think the data it presents makes that case. In particular, their “intermediate” checkpoint is too far away from the base model to correctly reference the crossover point (where the base model pass@k intersects the early RLVR pass@k). And the base model choice is strange for a study like this (it already has finetuning on DeepSeek-R1 traces in it, so the base model proper is mixed up with elicitation through R1 traces, when comparing with elicitation through subsequent RLVR).
In some of the plots, the intersection point isn’t visible, and mostly the “final” checkpoint seems to get worse than the “intermediate” checkpoint on pass@k plots at very high k, confirming rather than opposing the point of the Yue et al. paper (regarding the crossover point).
The fact that they’ve plotted pass@16 in Figure 1 as illustrative of the overall framing of the paper suggests that they aren’t grappling with the correct point, because if k=16 is earlier than the crossover point, then of course pass@16 performance will keep increasing. The question is whether it’ll ever exceed the performance at the crossover point.
(Of course, for sufficiently simple problems, RL works and can train a model to do things that the base model can’t do at all. And in principle RL should be able to do this in general, that’s the promise of RL. The question is whether it works for interesting problems that can’t be as easily solved with RL directly, using current methods for doing RLVR. If not, it can’t just be directly scaled to the moon within 1-2 years.)
Thank you for the quick reply.