SoerenMind comments on [AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee

SoerenMind 28 Nov 2019 18:56 UTC
7 points
My tentative view on MuZero:
- Cool for board games and related tasks.
- The Atari demo seems sketchy.
- Not a big step towards making model-based RL work—instead, a step making it more like model-free RL.
Why?
- A textbook benefit for model-based RL is that world models (i.e. prediction of observations) generalize to new reward functions and environments. They’ve removed this benefit.
- The other textbook benefit of model-based RL is data efficiency. But on Atari, MuZero is just as inefficient as model-free RL. In fact, MuZero moves a lot closer to model-free methods by removing the need to predict observations. And it’s roughly equally inefficient. Plus it trains with 40 TPUs per game where other algorithms use a single GPU and similar training time. What if they spent that extra compute to get more data?
- In the low-data setting they outperform model-free methods. But they suspiciously didn’t compare to any model-based method. They’d probably lose there because they’d need a world model for data efficiency.
- MuZero only plans for K=5 steps ahead—far less than AlphaZero. Two takeaways: 1) This again looks more similar to model-free RL which has essentially K=1. 2) This makes me more optimistic that model-free RL can learn Go with just a moderate efficiency (and stability?) loss (Paul has speculated this. Also, the trained AlphaZero policy net is apparently still better than Lee Sedol without MCTS).
What links here?
- SoerenMind's comment on [1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model | Arxiv by DragonGod (28 Nov 2019 19:15 UTC; 4 points)