I feel like there’s a somewhat common argument about RL not being all that dangerous because it generalizes the training distribution cautiously—being outside the training distribution isn’t going to suddenly cause an RL system to make multi-step plans that are implied but never seen in the training distribution, it’ll probably just fall back on familiar, safe behavior.
To me, these arguments feel like they treat present-day model-free RL as the “central case,” and model-based RL as a small correction.
Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.
I just deny that they will update “arbitrarily” far from the prior, and I don’t know why you would think otherwise. There are compute tradeoffs and you’re doing to run only as many MCTS rollouts as you need to get good performance.
There are compute tradeoffs and you’re doing to run only as many MCTS rollouts as you need to get good performance.
I completely agree. Smart agents will run only as many MCTS rollouts as they need to get good performance, no more—and no less. (And the smarter they are, and so the more compute they have access to, the more MCTS rollouts they are able to run, and the more they can change the default reactive policy.)
But ‘good performance’ on what, exactly? Maximizing utility. That’s what a model-based RL agent (not a simple-minded, unintelligent, myopic model-free policy like a frog’s retina) does.
If the Value of Information remains high from doing more MCTS rollouts, then an intelligent agent will keep doing rollouts for as long as the additional planning continues to pay its way in expected improvements. The point of doing planning is policy/value improvement. The more planning you do, the more you can change the original policy. (This is how you train AlphaZero so far from its prior policy, of a randomly-initialized CNN playing random moves, to its final planning-improved policy, a superhuman Go player.) Which may take it arbitrarily far in terms of policy—like, for example, if it discovers a Move 37 where there is even a small <1/1000 probability that a highly-unusual action will pay off better than the default reactive policy and so the possibility is worth examining in greater depth...
(The extreme reductio here would be a pure MCTS with random playouts: it has no policy at all at the beginning, and yet, MCTS is a complete algorithm, so it converges to the optimal policy, no matter what that is, given enough rollouts. More rollouts = more update away from the prior. Or if you don’t like that, good old policy/value iteration on a finite MDP is an example: start with random parameters and the more iterations they can do, the further they provably monotonically travel from the original random initialization to the optimal policy.)
One might say that the point of model-based RL is to not be stupid, and thus safe due to its stupidity, in all the ways you repeatedly emphasize purely model-free RL agents may be. And that’s why AGI will not be purely model-free, nor are our smartest current frontier models like LLMs purely model-free. I don’t see how you get this vision of AGI as some sort of gigantic frog retina, which is the strawman that you seem to be aiming at in all your arguments about why you are convinced there’s no danger.
Obviously AGI will do things like ‘plan’ or ‘model’ or ‘search’ - or if you think that it will not, you should say so explicitly, and be clear about what kind of algorithm you think AGI would be, and explain how you think that’s possible. I would be fascinated to hear how you think that superhuman intelligence in all domains like programming or math or long-term strategy could be done by purely model-free approaches which do not involve planning or searching or building models of the world or utility-maximization!
(Or to put it yet another way: ‘scheming’ is not a meaningful discrete category of capabilities, but a value judgment about particular ways to abuse theory of mind / world-modeling capabilities; and it’s hard to see how one could create an AGI smart enough to be ‘AGI’, but also so stupid as to not understand people or be incapable of basic human-level capabilities like ‘be a manager’ or ‘play poker’, or generalize modeling of other agents. It would be quite bizarre to imagine a model-free AGI which must learn a separate ‘simple’ reactive policy of ‘scheming’ for each and every agent it comes across, wasting a huge number of parameters & samples every time, as opposed to simply meta-learning how to model agents in general, and applying this using planning/search to all future tasks, at enormous parameter savings and zero-shot.)
This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting—http://www.athenasc.com/Frontmatter_LESSONS.pdf—since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven’t worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.
Model-based RL has a lot of room to use models more cleverly, e.g. learning hierarchical planning, and the better models are for planning, the more rewarding it is to let model-based planning take the policy far away from the prior.
E.g. you could get a hospital policy-maker that actually will do radical new things via model-based reasoning, rather than just breaking down when you try to push it too far from the training distribution (as you correctly point out a filtered LLM would).
In some sense the policy would still be close to the prior in a distance metric induced by the model-based planning procedure itself, but I think at that point the distance metric has come unmoored from the practical difference to humans.
I feel like there’s a somewhat common argument about RL not being all that dangerous because it generalizes the training distribution cautiously—being outside the training distribution isn’t going to suddenly cause an RL system to make multi-step plans that are implied but never seen in the training distribution, it’ll probably just fall back on familiar, safe behavior.
To me, these arguments feel like they treat present-day model-free RL as the “central case,” and model-based RL as a small correction.
Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.
I just deny that they will update “arbitrarily” far from the prior, and I don’t know why you would think otherwise. There are compute tradeoffs and you’re doing to run only as many MCTS rollouts as you need to get good performance.
I completely agree. Smart agents will run only as many MCTS rollouts as they need to get good performance, no more—and no less. (And the smarter they are, and so the more compute they have access to, the more MCTS rollouts they are able to run, and the more they can change the default reactive policy.)
But ‘good performance’ on what, exactly? Maximizing utility. That’s what a model-based RL agent (not a simple-minded, unintelligent, myopic model-free policy like a frog’s retina) does.
If the Value of Information remains high from doing more MCTS rollouts, then an intelligent agent will keep doing rollouts for as long as the additional planning continues to pay its way in expected improvements. The point of doing planning is policy/value improvement. The more planning you do, the more you can change the original policy. (This is how you train AlphaZero so far from its prior policy, of a randomly-initialized CNN playing random moves, to its final planning-improved policy, a superhuman Go player.) Which may take it arbitrarily far in terms of policy—like, for example, if it discovers a Move 37 where there is even a small <1/1000 probability that a highly-unusual action will pay off better than the default reactive policy and so the possibility is worth examining in greater depth...
(The extreme reductio here would be a pure MCTS with random playouts: it has no policy at all at the beginning, and yet, MCTS is a complete algorithm, so it converges to the optimal policy, no matter what that is, given enough rollouts. More rollouts = more update away from the prior. Or if you don’t like that, good old policy/value iteration on a finite MDP is an example: start with random parameters and the more iterations they can do, the further they provably monotonically travel from the original random initialization to the optimal policy.)
One might say that the point of model-based RL is to not be stupid, and thus safe due to its stupidity, in all the ways you repeatedly emphasize purely model-free RL agents may be. And that’s why AGI will not be purely model-free, nor are our smartest current frontier models like LLMs purely model-free. I don’t see how you get this vision of AGI as some sort of gigantic frog retina, which is the strawman that you seem to be aiming at in all your arguments about why you are convinced there’s no danger.
Obviously AGI will do things like ‘plan’ or ‘model’ or ‘search’ - or if you think that it will not, you should say so explicitly, and be clear about what kind of algorithm you think AGI would be, and explain how you think that’s possible. I would be fascinated to hear how you think that superhuman intelligence in all domains like programming or math or long-term strategy could be done by purely model-free approaches which do not involve planning or searching or building models of the world or utility-maximization!
(Or to put it yet another way: ‘scheming’ is not a meaningful discrete category of capabilities, but a value judgment about particular ways to abuse theory of mind / world-modeling capabilities; and it’s hard to see how one could create an AGI smart enough to be ‘AGI’, but also so stupid as to not understand people or be incapable of basic human-level capabilities like ‘be a manager’ or ‘play poker’, or generalize modeling of other agents. It would be quite bizarre to imagine a model-free AGI which must learn a separate ‘simple’ reactive policy of ‘scheming’ for each and every agent it comes across, wasting a huge number of parameters & samples every time, as opposed to simply meta-learning how to model agents in general, and applying this using planning/search to all future tasks, at enormous parameter savings and zero-shot.)
This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting—http://www.athenasc.com/Frontmatter_LESSONS.pdf—since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven’t worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.
Model-based RL has a lot of room to use models more cleverly, e.g. learning hierarchical planning, and the better models are for planning, the more rewarding it is to let model-based planning take the policy far away from the prior.
E.g. you could get a hospital policy-maker that actually will do radical new things via model-based reasoning, rather than just breaking down when you try to push it too far from the training distribution (as you correctly point out a filtered LLM would).
In some sense the policy would still be close to the prior in a distance metric induced by the model-based planning procedure itself, but I think at that point the distance metric has come unmoored from the practical difference to humans.