Rudi C comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Rudi C 1 Dec 2021 10:49 UTC
1 point
0
Expertise status: I am just starting with RL.

Will using a hardcoded model of the environment improve these models, or do the models need the representations they learn?

Using EfficientZero’s architecture, how many hours does it take on a single TPUv2 for the agent to reach amateur human level? In general, is EfficientZero being sample efficient or compute efficient?

What is the currently most compute efficient algorithm for simple, two-player deterministic games with a lot of states (e.g., go)?

PS: The reason I am asking is that I learn stuff by coding much better than merely reading, and I have been trying to write the simplest non-search-based, self-supervised algorithm I can for games like chess.

One of the main challenges I have faced is that cloning the current state, doing an action on it, and then using the neural net (a simple 4-layered MLP) to infer the new state’s value is quite slow (on the order of a second), so running MCTS seems to require a lot more compute. (Done using Colab Pro’s TPUv2, but the game state is managed by OpenSpiel on the CPU.)