You could equally well say: “AlphaZero learns, therefore a sufficiently good imitation of AlphaZero should also learn”. Right? But let’s think about what that would entail.
AlphaZero learns via a quite complicated algorithm involving tracking the state of a Go board through self-play, and each step of the self-play involves a tree search with thousands of queries to a 30M-parameter ConvNet, and then at the end of the game a Go engine is called to see who won and then there’s a set of gradient descent steps on that 30M-parameter ConvNet. Then repeat that whole process fifty million times. And now you have a trained AlphaZero.
Now, imagine taking some generic algorithm class (say, an RNN) and training it “to imitate the process by which AlphaZero learns”. It’s just not gonna work, right? Granted, RNNs are Turing complete, so perhaps one could prove that an astronomically large RNN trained on astronomically much data can emulate (in its astronomically large number of weights) this entire detailed process of running a self-play tree search and performing gradient descent on this 30M-parameter ConvNet. …But c’mon, that’s not gonna realistically work in practice, right? (Related: §3 here.)
IMO, the only realistic way to make something that learns like AlphaZero learns is to build AlphaZero itself, or at least something awfully similar to it. I think the tree search etc. needs to be in the source code, not implicit in the learned weights of some generic algorithm class like RNNs, with no superficial relation to tree search. …But if you do that, then I would call it “reverse-engineering AlphaZero”, not “imitation learning from AlphaZero”.
By the same token, I do think it’s possible to make something that learns like a human, but I think it would require reverse-engineering human brains, not just imitation-learning from human data.
I think your intuitions here are highly misguided. I don’t agree with your conclusions about AlphaZero at all. You could easily train a model by distilling AlphaZero. All the complicated steps are only necessary to bootstrap from nothing.
Yes distilling a snapshot of AlphaZero is easy. The hard part is distilling the process by which AlphaZero improves—not just bootstrapping from nothing, but also turning an Elo-2500 AlphaZero into an Elo-3500 AlphaZero.
Is this a way to operationalize our disagreement?
CLAIM:
Take AlphaZero-chess and train it (via self-play RL as usual) from scratch to Elo 2500 (grandmaster level), but no further.
Now take a generic DNN like a transformer. Give it training data showing how AlphaZero-in-training developed from Elo 0, to Elo 1, … to Elo 1000, to Elo 1001, to Elo 1002, … to Elo 2500. [We can use any or all of those AlphaZero-snapshots however we want to build our training dataset.] And we now have a trained model M.
Now use this trained model M by itself (no weight-updates, no self-play, just pure inference) to extrapolate this process of improvement forward.
The claim is: If we do this right, we can wind up with an Elo-3500 chess-playing agent (i.e. radically superhuman, comparable to what you’d get by continuing the AlphaZero self-play RL training for millions more games).
I feel very strongly that this claim is false. Do you think it’s true?
(This is relevant because I think that “the process by which AlphaZero-in-training goes from Elo 2500 to Elo 3500” is in the same general category as “the process by which a human goes from mediocre and confused understanding of some novel domain to deep understanding and expertise, over the course of weeks and months and years”.)
As I’ve said before, I think you greatly overrate the difficulty of putting search into neural nets, and this is an example of it. It seems to me like it is entirely possible to make a generic LLM implement an equivalent to AlphaZero and be capable of expert iteration, without an elaborate tree scaffolding. A tree search is just another algorithm which can be reified as a sequence, like all algorithms (because they are implemented on a computer).
All AlphaZero is, is a way of doing policy iteration/Newton updates by running a game state forward for a few plies, evaluating, and updating estimates. It’s not magic, and can obviously be encoded into a LLM’s generative process.
Here’s a concrete example of how in-principle I think a LLM can do AlphaZero-style expert iteration for Go: A LLM can serialize a board with value estimates as simply a few hundred tokens (361 points, 361 value estimates, miscellaneous metadata); this means in a frontier LLM like Claude-4-opus with 200k ctx, you can fit in easily 200 board states; so you can serialize out the lookahead of a bunch of possible moves and resulting board states (eg. take the top 14 moves and imagine the resulting board state and then imagine their next 14 top moves, for comparison, TD-Gammon looked forward like 1 move); and can back-propagate an updated value estimate, and spit out the original board state with better value estimates. “Move #4 was better than it looked, so I will +0.01 to the value estimate for it.” This improved board is now in context, and can be dynamically-evaluated to update the LLM: now it has to predict the new board state with the final improved estimates, and that improves the policy. The LLM finishes by setting up the next planning step: pick a deeper board state to evaluate next, and if the next board state is the end of the game, then it starts over with a fresh game. Run this indefinitely.
It repeatedly iterates through a possible game, evaluating each position to a certain depth, updating its weights to incorporate the policy improvement from the evaluation, and restarting with a fresh game. All serialized out as a long array/sequence, the tree just being implicitly represented by successive board states. (And then now that you have that in mind, you can imagine how to do things like deep rollouts: 200 moves is around a normal game of Go, so random rollouts are doable from most board states, and the LLM can just toggle between a shallow tree search and deep randomized rollouts if necessary eg by adding a 0⁄1 token prefix.)
At no point do you need explicit tree scaffolding as you bootstrap from a LLM clueless about playing Go to the high performance that we know LLMs trained by imitation learning on board states/values/policies can reach, and at no point have I invoked a cognitive operation which is not easier than a lot of things we see LLMs do routinely, or where it’s implausible that they could do it. It is probably a lot less efficient and has other practical issues like how you integrate the rules of Go akin to AlphaZero/MuZero, etc, but in principle I think this algorithm is well-defined, concrete, and would work.
Hmm, I don’t particularly disagree with anything you wrote. I think you’re misunderstanding the context of this conversation.
I wasn’t bringing up tree search because I think tree search is required for AGI. (I don’t think that.)
Rather, I was making a point that there will need to be some system that updates the weights (not activations) of an AGI as it runs, just as adult humans learn and figure out new things over time as they work on a project.
What is this system that will update the weights? I have opinions, but in general, there are lots of possible approaches. Self-play-RL with tree search is one possibility. RL without tree search is another possibility. The system you described in your comment is yet a third possibility. Whatever! I don’t care, that’s not my point here.
What is my point? How did this come up? Well, Cole’s OP is relying on the fact that “[pure] imitation learning is probably existentially safe”. And I was saying that pure imitation learning imposes a horrific capability tax that destroys his whole plan, because a human has open-ended autonomous learning, whereas a model trained by pure imitation learning (on that same human) does not. So you cannot simply swap out the former for the latter.
In Cole’s most recent reply, it appears that what he has in mind is actually a system that’s initialized by being trained to imitate humans, but then it also has some system for open-ended continuous learning from that starting point.
And then I replied that this would solve the capability issue, but only by creating a new problem that “[pure] imitation learning is probably existentially safe” can no longer function as part of his safety argument, because the continuous learning may affect alignment.
For example, if you initialize a PacMan RL agent on human imitation (where the humans were all very nice to the ghosts during play), and then you set up that agent to continuously improve by RL policy optimization, using the score as the reward function, then it’s gonna rapidly stop being nice to the ghosts.
That’s not what I have in mind, see my more most recent reply.
Also, I am not sure that removing the imitation learning step would actually “destroy my whole plan.” It would perhaps prevent it from scaling past a certain point, but I think we would still be left in a much more tractable position.
Before LLMs reach AGI, someone will have to solve efficient, online continual learning. This is an open technical problem, which is why I doubt that the current paradigm scales to superintelligence. It seems that an appropriate solution for general-purpose agents would also lead to a solution for agents trained through imitation learning.
Next: If we take an “agent trained through imitation learning”, and glue on a “solution to efficient, online continual learning”, then the result (after it runs a while) is NOT
“an agent trained through imitation learning”,
but rather
“an agent that is partly trained through imitation learning, and partly trained through [however the online continual learning works]”.
Right?
And now your proposal requires an assumption that this online continual learning system, whatever it is, does not undermine the agent’s alignment. Right?
I’m not suggesting an agent that is partly trained through imitation learning, and then partly trained through continual learning on some other objective. I am suggesting an agent that is trained solely through imitation learning, using improved algorithms that more faithfully imitate humans over longer timescales, including by learning because humans learn—but by learning as humans learn! I think that the obstacles to doing this are very similar to the obstacles to continual learning in LLMs, though they are not exactly the same, and it’s certainly conceivable that LLM algorithms for continual learning will be invented which are not transferable to pure imitation learning. In particular, LLMs may start some kind of feedback loop of recursive self-improvement before faithful imitation learning becomes technically feasible. However, I see no fundamental reason to expect that is the only or most likely path. And all alignment plans are sunk by recursive self-improvement happening tomorrow.
Explicitly, LLMs are not perfect assistants or agents because their in-context learning is limited. This problem is not specific to fine tuned models though—even base models have limited in-context learning. The most direct solutions to this problem would allow them to perform in-context learning with the same objective as they already do (sequence prediction) but for longer. The analogue of this for imitation learning should similarly perform imitation learning, and then imitate faithfully for longer—including “in-context” learning as necessary.
it is probably true that future pure imitation learning techniques can capture the process by which humans figure out new scientific ideas over millions of seconds, AND
it is “certainly false” that future pure imitation learning techniques can capture the process by which AlphaZero figures out new chess strategies over millions of games
then I’m curious what accounts for the difference, in your mind?
More detail, just to make sure we’re on the same page: The analogy I’m suggesting is:
(A1) AlphaZero goes from Elo 0 to Elo 2500
(A2) …via self-play RL
(A3) Future pure imitation learner extrapolates this process forward to get Elo 3500 chess skill
-versus-
(B1) Human civilization goes from “totally clueless about nanotech design principles / technical alignment / whatever” in 1900 to “somewhat confused about nanotech design principles / technical alignment / whatever” in 2025
(B2) …via whatever human brains are doing (which I claim centrally involves RL)
(B3) Future pure imitation learner extrapolates this process forward to get crystal-clear understanding of nanotech design principles / technical alignment / whatever
You think that (A3) is “certainly false” while (B3) is plausible, and I’m asking what you see as the disanalogy.
(For the record, I think both (A3) and (B3) are implausible. I think that LLM in-context learning can capture the way that humans figure out new things over seconds, but not the way that humans figure out new things over weeks and months. And I don’t think that’s a solvable problem, but rather points to a deep deficiency in imitation learning, a deficiency which is only solvable by learning algorithms with non-imitation-learning objectives.)
I didn’t realize you intended A3 to refer to future imitation learning systems. In that case, yes, it will work. You might have to use some tricks similar to gwern’s suggestions—e.g. the imitation learner should (for fair comparison) also have access to the simulation platform that AlphaZero uses, and would have to play about as many games as AlphaZero plays. But it does not have to do the same search and policy distillation training process that AlphaZero does.
You could equally well say: “AlphaZero learns, therefore a sufficiently good imitation of AlphaZero should also learn”. Right? But let’s think about what that would entail.
AlphaZero learns via a quite complicated algorithm involving tracking the state of a Go board through self-play, and each step of the self-play involves a tree search with thousands of queries to a 30M-parameter ConvNet, and then at the end of the game a Go engine is called to see who won and then there’s a set of gradient descent steps on that 30M-parameter ConvNet. Then repeat that whole process fifty million times. And now you have a trained AlphaZero.
Now, imagine taking some generic algorithm class (say, an RNN) and training it “to imitate the process by which AlphaZero learns”. It’s just not gonna work, right? Granted, RNNs are Turing complete, so perhaps one could prove that an astronomically large RNN trained on astronomically much data can emulate (in its astronomically large number of weights) this entire detailed process of running a self-play tree search and performing gradient descent on this 30M-parameter ConvNet. …But c’mon, that’s not gonna realistically work in practice, right? (Related: §3 here.)
IMO, the only realistic way to make something that learns like AlphaZero learns is to build AlphaZero itself, or at least something awfully similar to it. I think the tree search etc. needs to be in the source code, not implicit in the learned weights of some generic algorithm class like RNNs, with no superficial relation to tree search. …But if you do that, then I would call it “reverse-engineering AlphaZero”, not “imitation learning from AlphaZero”.
By the same token, I do think it’s possible to make something that learns like a human, but I think it would require reverse-engineering human brains, not just imitation-learning from human data.
I think your intuitions here are highly misguided. I don’t agree with your conclusions about AlphaZero at all. You could easily train a model by distilling AlphaZero. All the complicated steps are only necessary to bootstrap from nothing.
Yes distilling a snapshot of AlphaZero is easy. The hard part is distilling the process by which AlphaZero improves—not just bootstrapping from nothing, but also turning an Elo-2500 AlphaZero into an Elo-3500 AlphaZero.
Is this a way to operationalize our disagreement?
I feel very strongly that this claim is false. Do you think it’s true?
(This is relevant because I think that “the process by which AlphaZero-in-training goes from Elo 2500 to Elo 3500” is in the same general category as “the process by which a human goes from mediocre and confused understanding of some novel domain to deep understanding and expertise, over the course of weeks and months and years”.)
As I’ve said before, I think you greatly overrate the difficulty of putting search into neural nets, and this is an example of it. It seems to me like it is entirely possible to make a generic LLM implement an equivalent to AlphaZero and be capable of expert iteration, without an elaborate tree scaffolding. A tree search is just another algorithm which can be reified as a sequence, like all algorithms (because they are implemented on a computer).
All AlphaZero is, is a way of doing policy iteration/Newton updates by running a game state forward for a few plies, evaluating, and updating estimates. It’s not magic, and can obviously be encoded into a LLM’s generative process.
Here’s a concrete example of how in-principle I think a LLM can do AlphaZero-style expert iteration for Go: A LLM can serialize a board with value estimates as simply a few hundred tokens (361 points, 361 value estimates, miscellaneous metadata); this means in a frontier LLM like Claude-4-opus with 200k ctx, you can fit in easily 200 board states; so you can serialize out the lookahead of a bunch of possible moves and resulting board states (eg. take the top 14 moves and imagine the resulting board state and then imagine their next 14 top moves, for comparison, TD-Gammon looked forward like 1 move); and can back-propagate an updated value estimate, and spit out the original board state with better value estimates. “Move #4 was better than it looked, so I will +0.01 to the value estimate for it.” This improved board is now in context, and can be dynamically-evaluated to update the LLM: now it has to predict the new board state with the final improved estimates, and that improves the policy. The LLM finishes by setting up the next planning step: pick a deeper board state to evaluate next, and if the next board state is the end of the game, then it starts over with a fresh game. Run this indefinitely.
It repeatedly iterates through a possible game, evaluating each position to a certain depth, updating its weights to incorporate the policy improvement from the evaluation, and restarting with a fresh game. All serialized out as a long array/sequence, the tree just being implicitly represented by successive board states. (And then now that you have that in mind, you can imagine how to do things like deep rollouts: 200 moves is around a normal game of Go, so random rollouts are doable from most board states, and the LLM can just toggle between a shallow tree search and deep randomized rollouts if necessary eg by adding a 0⁄1 token prefix.)
At no point do you need explicit tree scaffolding as you bootstrap from a LLM clueless about playing Go to the high performance that we know LLMs trained by imitation learning on board states/values/policies can reach, and at no point have I invoked a cognitive operation which is not easier than a lot of things we see LLMs do routinely, or where it’s implausible that they could do it. It is probably a lot less efficient and has other practical issues like how you integrate the rules of Go akin to AlphaZero/MuZero, etc, but in principle I think this algorithm is well-defined, concrete, and would work.
Hmm, I don’t particularly disagree with anything you wrote. I think you’re misunderstanding the context of this conversation.
I wasn’t bringing up tree search because I think tree search is required for AGI. (I don’t think that.)
Rather, I was making a point that there will need to be some system that updates the weights (not activations) of an AGI as it runs, just as adult humans learn and figure out new things over time as they work on a project.
What is this system that will update the weights? I have opinions, but in general, there are lots of possible approaches. Self-play-RL with tree search is one possibility. RL without tree search is another possibility. The system you described in your comment is yet a third possibility. Whatever! I don’t care, that’s not my point here.
What is my point? How did this come up? Well, Cole’s OP is relying on the fact that “[pure] imitation learning is probably existentially safe”. And I was saying that pure imitation learning imposes a horrific capability tax that destroys his whole plan, because a human has open-ended autonomous learning, whereas a model trained by pure imitation learning (on that same human) does not. So you cannot simply swap out the former for the latter.
In Cole’s most recent reply, it appears that what he has in mind is actually a system that’s initialized by being trained to imitate humans, but then it also has some system for open-ended continuous learning from that starting point.
And then I replied that this would solve the capability issue, but only by creating a new problem that “[pure] imitation learning is probably existentially safe” can no longer function as part of his safety argument, because the continuous learning may affect alignment.
For example, if you initialize a PacMan RL agent on human imitation (where the humans were all very nice to the ghosts during play), and then you set up that agent to continuously improve by RL policy optimization, using the score as the reward function, then it’s gonna rapidly stop being nice to the ghosts.
Does that help explain where I’m coming from?
That’s not what I have in mind, see my more most recent reply.
Also, I am not sure that removing the imitation learning step would actually “destroy my whole plan.” It would perhaps prevent it from scaling past a certain point, but I think we would still be left in a much more tractable position.
The claim is certainly false.
Before LLMs reach AGI, someone will have to solve efficient, online continual learning. This is an open technical problem, which is why I doubt that the current paradigm scales to superintelligence. It seems that an appropriate solution for general-purpose agents would also lead to a solution for agents trained through imitation learning.
Great, glad we agree on that!
Next: If we take an “agent trained through imitation learning”, and glue on a “solution to efficient, online continual learning”, then the result (after it runs a while) is NOT
“an agent trained through imitation learning”,
but rather
“an agent that is partly trained through imitation learning, and partly trained through [however the online continual learning works]”.
Right?
And now your proposal requires an assumption that this online continual learning system, whatever it is, does not undermine the agent’s alignment. Right?
I’m not suggesting an agent that is partly trained through imitation learning, and then partly trained through continual learning on some other objective. I am suggesting an agent that is trained solely through imitation learning, using improved algorithms that more faithfully imitate humans over longer timescales, including by learning because humans learn—but by learning as humans learn! I think that the obstacles to doing this are very similar to the obstacles to continual learning in LLMs, though they are not exactly the same, and it’s certainly conceivable that LLM algorithms for continual learning will be invented which are not transferable to pure imitation learning. In particular, LLMs may start some kind of feedback loop of recursive self-improvement before faithful imitation learning becomes technically feasible. However, I see no fundamental reason to expect that is the only or most likely path. And all alignment plans are sunk by recursive self-improvement happening tomorrow.
Explicitly, LLMs are not perfect assistants or agents because their in-context learning is limited. This problem is not specific to fine tuned models though—even base models have limited in-context learning. The most direct solutions to this problem would allow them to perform in-context learning with the same objective as they already do (sequence prediction) but for longer. The analogue of this for imitation learning should similarly perform imitation learning, and then imitate faithfully for longer—including “in-context” learning as necessary.
(Thanks for your patient engagement!)
If you believe
it is probably true that future pure imitation learning techniques can capture the process by which humans figure out new scientific ideas over millions of seconds, AND
it is “certainly false” that future pure imitation learning techniques can capture the process by which AlphaZero figures out new chess strategies over millions of games
then I’m curious what accounts for the difference, in your mind?
More detail, just to make sure we’re on the same page: The analogy I’m suggesting is:
(A1) AlphaZero goes from Elo 0 to Elo 2500
(A2) …via self-play RL
(A3) Future pure imitation learner extrapolates this process forward to get Elo 3500 chess skill
-versus-
(B1) Human civilization goes from “totally clueless about nanotech design principles / technical alignment / whatever” in 1900 to “somewhat confused about nanotech design principles / technical alignment / whatever” in 2025
(B2) …via whatever human brains are doing (which I claim centrally involves RL)
(B3) Future pure imitation learner extrapolates this process forward to get crystal-clear understanding of nanotech design principles / technical alignment / whatever
You think that (A3) is “certainly false” while (B3) is plausible, and I’m asking what you see as the disanalogy.
(For the record, I think both (A3) and (B3) are implausible. I think that LLM in-context learning can capture the way that humans figure out new things over seconds, but not the way that humans figure out new things over weeks and months. And I don’t think that’s a solvable problem, but rather points to a deep deficiency in imitation learning, a deficiency which is only solvable by learning algorithms with non-imitation-learning objectives.)
I didn’t realize you intended A3 to refer to future imitation learning systems. In that case, yes, it will work. You might have to use some tricks similar to gwern’s suggestions—e.g. the imitation learner should (for fair comparison) also have access to the simulation platform that AlphaZero uses, and would have to play about as many games as AlphaZero plays. But it does not have to do the same search and policy distillation training process that AlphaZero does.