There are compute tradeoffs and you’re doing to run only as many MCTS rollouts as you need to get good performance.
I completely agree. Smart agents will run only as many MCTS rollouts as they need to get good performance, no more—and no less. (And the smarter they are, and so the more compute they have access to, the more MCTS rollouts they are able to run, and the more they can change the default reactive policy.)
But ‘good performance’ on what, exactly? Maximizing utility. That’s what a model-based RL agent (not a simple-minded, unintelligent, myopic model-free policy like a frog’s retina) does.
If the Value of Information remains high from doing more MCTS rollouts, then an intelligent agent will keep doing rollouts for as long as the additional planning continues to pay its way in expected improvements. The point of doing planning is policy/value improvement. The more planning you do, the more you can change the original policy. (This is how you train AlphaZero so far from its prior policy, of a randomly-initialized CNN playing random moves, to its final planning-improved policy, a superhuman Go player.) Which may take it arbitrarily far in terms of policy—like, for example, if it discovers a Move 37 where there is even a small <1/1000 probability that a highly-unusual action will pay off better than the default reactive policy and so the possibility is worth examining in greater depth...
(The extreme reductio here would be a pure MCTS with random playouts: it has no policy at all at the beginning, and yet, MCTS is a complete algorithm, so it converges to the optimal policy, no matter what that is, given enough rollouts. More rollouts = more update away from the prior. Or if you don’t like that, good old policy/value iteration on a finite MDP is an example: start with random parameters and the more iterations they can do, the further they provably monotonically travel from the original random initialization to the optimal policy.)
One might say that the point of model-based RL is to not be stupid, and thus safe due to its stupidity, in all the ways you repeatedly emphasize purely model-free RL agents may be. And that’s why AGI will not be purely model-free, nor are our smartest current frontier models like LLMs purely model-free. I don’t see how you get this vision of AGI as some sort of gigantic frog retina, which is the strawman that you seem to be aiming at in all your arguments about why you are convinced there’s no danger.
Obviously AGI will do things like ‘plan’ or ‘model’ or ‘search’ - or if you think that it will not, you should say so explicitly, and be clear about what kind of algorithm you think AGI would be, and explain how you think that’s possible. I would be fascinated to hear how you think that superhuman intelligence in all domains like programming or math or long-term strategy could be done by purely model-free approaches which do not involve planning or searching or building models of the world or utility-maximization!
(Or to put it yet another way: ‘scheming’ is not a meaningful discrete category of capabilities, but a value judgment about particular ways to abuse theory of mind / world-modeling capabilities; and it’s hard to see how one could create an AGI smart enough to be ‘AGI’, but also so stupid as to not understand people or be incapable of basic human-level capabilities like ‘be a manager’ or ‘play poker’, or generalize modeling of other agents. It would be quite bizarre to imagine a model-free AGI which must learn a separate ‘simple’ reactive policy of ‘scheming’ for each and every agent it comes across, wasting a huge number of parameters & samples every time, as opposed to simply meta-learning how to model agents in general, and applying this using planning/search to all future tasks, at enormous parameter savings and zero-shot.)
I completely agree. Smart agents will run only as many MCTS rollouts as they need to get good performance, no more—and no less. (And the smarter they are, and so the more compute they have access to, the more MCTS rollouts they are able to run, and the more they can change the default reactive policy.)
But ‘good performance’ on what, exactly? Maximizing utility. That’s what a model-based RL agent (not a simple-minded, unintelligent, myopic model-free policy like a frog’s retina) does.
If the Value of Information remains high from doing more MCTS rollouts, then an intelligent agent will keep doing rollouts for as long as the additional planning continues to pay its way in expected improvements. The point of doing planning is policy/value improvement. The more planning you do, the more you can change the original policy. (This is how you train AlphaZero so far from its prior policy, of a randomly-initialized CNN playing random moves, to its final planning-improved policy, a superhuman Go player.) Which may take it arbitrarily far in terms of policy—like, for example, if it discovers a Move 37 where there is even a small <1/1000 probability that a highly-unusual action will pay off better than the default reactive policy and so the possibility is worth examining in greater depth...
(The extreme reductio here would be a pure MCTS with random playouts: it has no policy at all at the beginning, and yet, MCTS is a complete algorithm, so it converges to the optimal policy, no matter what that is, given enough rollouts. More rollouts = more update away from the prior. Or if you don’t like that, good old policy/value iteration on a finite MDP is an example: start with random parameters and the more iterations they can do, the further they provably monotonically travel from the original random initialization to the optimal policy.)
One might say that the point of model-based RL is to not be stupid, and thus safe due to its stupidity, in all the ways you repeatedly emphasize purely model-free RL agents may be. And that’s why AGI will not be purely model-free, nor are our smartest current frontier models like LLMs purely model-free. I don’t see how you get this vision of AGI as some sort of gigantic frog retina, which is the strawman that you seem to be aiming at in all your arguments about why you are convinced there’s no danger.
Obviously AGI will do things like ‘plan’ or ‘model’ or ‘search’ - or if you think that it will not, you should say so explicitly, and be clear about what kind of algorithm you think AGI would be, and explain how you think that’s possible. I would be fascinated to hear how you think that superhuman intelligence in all domains like programming or math or long-term strategy could be done by purely model-free approaches which do not involve planning or searching or building models of the world or utility-maximization!
(Or to put it yet another way: ‘scheming’ is not a meaningful discrete category of capabilities, but a value judgment about particular ways to abuse theory of mind / world-modeling capabilities; and it’s hard to see how one could create an AGI smart enough to be ‘AGI’, but also so stupid as to not understand people or be incapable of basic human-level capabilities like ‘be a manager’ or ‘play poker’, or generalize modeling of other agents. It would be quite bizarre to imagine a model-free AGI which must learn a separate ‘simple’ reactive policy of ‘scheming’ for each and every agent it comes across, wasting a huge number of parameters & samples every time, as opposed to simply meta-learning how to model agents in general, and applying this using planning/search to all future tasks, at enormous parameter savings and zero-shot.)