OpenAI Five plays 180 years worth of games against itself every day, learning via self-play. It trains using a scaled-up version of Proximal Policy Optimization running on 256 GPUs and 128,000 CPU cores — a larger-scale version of the system we built to play the much-simpler solo variant of the game last year. Using a separate LSTM for each hero and no human data, it learns recognizable strategies. This indicates that reinforcement learning can yield long-term planning with large but achievable scale — without fundamental advances, contrary to our own expectations upon starting the project.
RL researchers (including ourselves) have generally believed that long time horizons would require fundamentally new advances, such as hierarchicalreinforcementlearning. Our results suggest that we haven’t been giving today’s algorithms enough credit — at least when they’re run at sufficient scale and with a reasonable way of exploring.
We are very encouraged by the algorithmic implication of this result — in fact, it mirrors closely the story of deep learning (existing algorithms at large scale solve otherwise unsolvable problems). If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it. This still needs to be proven out in real-world domains, but it will be very interesting to see the full ramifications of this finding.
In other words: Current algorithms do seem to be able to tackle levels of sophistication (long time horizons, imperfect information, high-dimensional option space) that even experienced researchers wouldn’t have predicted, if you give them enough compute. And this person suggests that they could tackle even more sophisticated problems as long as you have a simulator for the problem domain.
From a Hacker News comment by one of the researchers:
In other words: Current algorithms do seem to be able to tackle levels of sophistication (long time horizons, imperfect information, high-dimensional option space) that even experienced researchers wouldn’t have predicted, if you give them enough compute. And this person suggests that they could tackle even more sophisticated problems as long as you have a simulator for the problem domain.
One implication would be that we can solve protein folding.
Indeed, we did with AlphaFold II 2 years ago.
Receptor-binding behavior is the next step to go.