As I’ve said before, I think you greatly overrate the difficulty of putting search into neural nets, and this is an example of it. It seems to me like it is entirely possible to make a generic LLM implement an equivalent to AlphaZero and be capable of expert iteration, without an elaborate tree scaffolding. A tree search is just another algorithm which can be reified as a sequence, like all algorithms (because they are implemented on a computer).
All AlphaZero is, is a way of doing policy iteration/Newton updates by running a game state forward for a few plies, evaluating, and updating estimates. It’s not magic, and can obviously be encoded into a LLM’s generative process.
Here’s a concrete example of how in-principle I think a LLM can do AlphaZero-style expert iteration for Go: A LLM can serialize a board with value estimates as simply a few hundred tokens (361 points, 361 value estimates, miscellaneous metadata); this means in a frontier LLM like Claude-4-opus with 200k ctx, you can fit in easily 200 board states; so you can serialize out the lookahead of a bunch of possible moves and resulting board states (eg. take the top 14 moves and imagine the resulting board state and then imagine their next 14 top moves, for comparison, TD-Gammon looked forward like 1 move); and can back-propagate an updated value estimate, and spit out the original board state with better value estimates. “Move #4 was better than it looked, so I will +0.01 to the value estimate for it.” This improved board is now in context, and can be dynamically-evaluated to update the LLM: now it has to predict the new board state with the final improved estimates, and that improves the policy. The LLM finishes by setting up the next planning step: pick a deeper board state to evaluate next, and if the next board state is the end of the game, then it starts over with a fresh game. Run this indefinitely.
It repeatedly iterates through a possible game, evaluating each position to a certain depth, updating its weights to incorporate the policy improvement from the evaluation, and restarting with a fresh game. All serialized out as a long array/sequence, the tree just being implicitly represented by successive board states. (And then now that you have that in mind, you can imagine how to do things like deep rollouts: 200 moves is around a normal game of Go, so random rollouts are doable from most board states, and the LLM can just toggle between a shallow tree search and deep randomized rollouts if necessary eg by adding a 0⁄1 token prefix.)
At no point do you need explicit tree scaffolding as you bootstrap from a LLM clueless about playing Go to the high performance that we know LLMs trained by imitation learning on board states/values/policies can reach, and at no point have I invoked a cognitive operation which is not easier than a lot of things we see LLMs do routinely, or where it’s implausible that they could do it. It is probably a lot less efficient and has other practical issues like how you integrate the rules of Go akin to AlphaZero/MuZero, etc, but in principle I think this algorithm is well-defined, concrete, and would work.
Hmm, I don’t particularly disagree with anything you wrote. I think you’re misunderstanding the context of this conversation.
I wasn’t bringing up tree search because I think tree search is required for AGI. (I don’t think that.)
Rather, I was making a point that there will need to be some system that updates the weights (not activations) of an AGI as it runs, just as adult humans learn and figure out new things over time as they work on a project.
What is this system that will update the weights? I have opinions, but in general, there are lots of possible approaches. Self-play-RL with tree search is one possibility. RL without tree search is another possibility. The system you described in your comment is yet a third possibility. Whatever! I don’t care, that’s not my point here.
What is my point? How did this come up? Well, Cole’s OP is relying on the fact that “[pure] imitation learning is probably existentially safe”. And I was saying that pure imitation learning imposes a horrific capability tax that destroys his whole plan, because a human has open-ended autonomous learning, whereas a model trained by pure imitation learning (on that same human) does not. So you cannot simply swap out the former for the latter.
In Cole’s most recent reply, it appears that what he has in mind is actually a system that’s initialized by being trained to imitate humans, but then it also has some system for open-ended continuous learning from that starting point.
And then I replied that this would solve the capability issue, but only by creating a new problem that “[pure] imitation learning is probably existentially safe” can no longer function as part of his safety argument, because the continuous learning may affect alignment.
For example, if you initialize a PacMan RL agent on human imitation (where the humans were all very nice to the ghosts during play), and then you set up that agent to continuously improve by RL policy optimization, using the score as the reward function, then it’s gonna rapidly stop being nice to the ghosts.
That’s not what I have in mind, see my more most recent reply.
Also, I am not sure that removing the imitation learning step would actually “destroy my whole plan.” It would perhaps prevent it from scaling past a certain point, but I think we would still be left in a much more tractable position.
As I’ve said before, I think you greatly overrate the difficulty of putting search into neural nets, and this is an example of it. It seems to me like it is entirely possible to make a generic LLM implement an equivalent to AlphaZero and be capable of expert iteration, without an elaborate tree scaffolding. A tree search is just another algorithm which can be reified as a sequence, like all algorithms (because they are implemented on a computer).
All AlphaZero is, is a way of doing policy iteration/Newton updates by running a game state forward for a few plies, evaluating, and updating estimates. It’s not magic, and can obviously be encoded into a LLM’s generative process.
Here’s a concrete example of how in-principle I think a LLM can do AlphaZero-style expert iteration for Go: A LLM can serialize a board with value estimates as simply a few hundred tokens (361 points, 361 value estimates, miscellaneous metadata); this means in a frontier LLM like Claude-4-opus with 200k ctx, you can fit in easily 200 board states; so you can serialize out the lookahead of a bunch of possible moves and resulting board states (eg. take the top 14 moves and imagine the resulting board state and then imagine their next 14 top moves, for comparison, TD-Gammon looked forward like 1 move); and can back-propagate an updated value estimate, and spit out the original board state with better value estimates. “Move #4 was better than it looked, so I will +0.01 to the value estimate for it.” This improved board is now in context, and can be dynamically-evaluated to update the LLM: now it has to predict the new board state with the final improved estimates, and that improves the policy. The LLM finishes by setting up the next planning step: pick a deeper board state to evaluate next, and if the next board state is the end of the game, then it starts over with a fresh game. Run this indefinitely.
It repeatedly iterates through a possible game, evaluating each position to a certain depth, updating its weights to incorporate the policy improvement from the evaluation, and restarting with a fresh game. All serialized out as a long array/sequence, the tree just being implicitly represented by successive board states. (And then now that you have that in mind, you can imagine how to do things like deep rollouts: 200 moves is around a normal game of Go, so random rollouts are doable from most board states, and the LLM can just toggle between a shallow tree search and deep randomized rollouts if necessary eg by adding a 0⁄1 token prefix.)
At no point do you need explicit tree scaffolding as you bootstrap from a LLM clueless about playing Go to the high performance that we know LLMs trained by imitation learning on board states/values/policies can reach, and at no point have I invoked a cognitive operation which is not easier than a lot of things we see LLMs do routinely, or where it’s implausible that they could do it. It is probably a lot less efficient and has other practical issues like how you integrate the rules of Go akin to AlphaZero/MuZero, etc, but in principle I think this algorithm is well-defined, concrete, and would work.
Hmm, I don’t particularly disagree with anything you wrote. I think you’re misunderstanding the context of this conversation.
I wasn’t bringing up tree search because I think tree search is required for AGI. (I don’t think that.)
Rather, I was making a point that there will need to be some system that updates the weights (not activations) of an AGI as it runs, just as adult humans learn and figure out new things over time as they work on a project.
What is this system that will update the weights? I have opinions, but in general, there are lots of possible approaches. Self-play-RL with tree search is one possibility. RL without tree search is another possibility. The system you described in your comment is yet a third possibility. Whatever! I don’t care, that’s not my point here.
What is my point? How did this come up? Well, Cole’s OP is relying on the fact that “[pure] imitation learning is probably existentially safe”. And I was saying that pure imitation learning imposes a horrific capability tax that destroys his whole plan, because a human has open-ended autonomous learning, whereas a model trained by pure imitation learning (on that same human) does not. So you cannot simply swap out the former for the latter.
In Cole’s most recent reply, it appears that what he has in mind is actually a system that’s initialized by being trained to imitate humans, but then it also has some system for open-ended continuous learning from that starting point.
And then I replied that this would solve the capability issue, but only by creating a new problem that “[pure] imitation learning is probably existentially safe” can no longer function as part of his safety argument, because the continuous learning may affect alignment.
For example, if you initialize a PacMan RL agent on human imitation (where the humans were all very nice to the ghosts during play), and then you set up that agent to continuously improve by RL policy optimization, using the score as the reward function, then it’s gonna rapidly stop being nice to the ghosts.
Does that help explain where I’m coming from?
That’s not what I have in mind, see my more most recent reply.
Also, I am not sure that removing the imitation learning step would actually “destroy my whole plan.” It would perhaps prevent it from scaling past a certain point, but I think we would still be left in a much more tractable position.