Not a big step towards making model-based RL work—instead, a step making it more like model-free RL.
Why?
A textbook benefit for model-based RL is that world models (i.e. prediction of observations) generalize to new reward functions and environments. They’ve removed this benefit.
The other textbook benefit of model-based RL is data efficiency. But on Atari, MuZero is just as inefficient as model-free RL. In fact, MuZero moves a lot closer to model-free methods by removing the need to predict observations. And it’s roughly equally inefficient. Plus it trains with 40 TPUs per game where other algorithms use a single GPU and similar training time. What if they spent that extra compute to get more data?
In the low-data setting they outperform model-free methods. But they suspiciously didn’t compare to any model-based method. They’d probably lose there because they’d need a world model for data efficiency.
MuZero only plans for K=5 steps ahead—far less than AlphaZero. Two takeaways: 1) This again looks more similar to model-free RL which has essentially K=1. 2) This makes me more optimistic that model-free RL can learn Go with just a moderate efficiency (and stability?) loss (Paul has speculated this. Also, the trained AlphaZero policy net is apparently still better than Lee Sedol without MCTS).
My tentative view on MuZero:
Cool for board games and related tasks.
The Atari demo seems sketchy.
Not a big step towards making model-based RL work—instead, a step making it more like model-free RL.
Why?
A textbook benefit for model-based RL is that world models (i.e. prediction of observations) generalize to new reward functions and environments. They’ve removed this benefit.
The other textbook benefit of model-based RL is data efficiency. But on Atari, MuZero is just as inefficient as model-free RL. In fact, MuZero moves a lot closer to model-free methods by removing the need to predict observations. And it’s roughly equally inefficient. Plus it trains with 40 TPUs per game where other algorithms use a single GPU and similar training time. What if they spent that extra compute to get more data?
In the low-data setting they outperform model-free methods. But they suspiciously didn’t compare to any model-based method. They’d probably lose there because they’d need a world model for data efficiency.
MuZero only plans for K=5 steps ahead—far less than AlphaZero. Two takeaways: 1) This again looks more similar to model-free RL which has essentially K=1. 2) This makes me more optimistic that model-free RL can learn Go with just a moderate efficiency (and stability?) loss (Paul has speculated this. Also, the trained AlphaZero policy net is apparently still better than Lee Sedol without MCTS).