gwern comments on Bootstrapping Language Models

gwern 30 May 2022 21:57 UTC
3 points
0

Consider a model learning to play Go against an opponent (filter) that always chooses random moves. The model will quickly learn to beat the random opponent, but will stop improving shortly afterwards and would do poorly against a human. Once the model starts winning every time, the outcomes of the games are perfectly predictable and provide no information. In this case the filter/opponent can’t discriminate between better-than-random opponents and better-than-human opponents, so the model performance stagnates.

This is going to depend on what sort of model and training regime we are talking about, and how flexible you are in finding some component to label a ‘filter’.

Consider an evolutionary agent like evolution strategies: model-free, policy. It mutates, rolls out games, and each mutant wins half the time initially, creating fitness gradients between winner & loser but it quickly homes in on some very simple tricks which let it defeat the random baseline ~100% of the time. Then, because there are no longer any fitness gradients, learning immediately halts there. The model successfully learns, but as little as possible. If the mutation rate (learning rate) doesn’t decay, it will wander around model space, only periodically purging bad mutants to maintain minimum adequacy; given enough time maybe it’d do something like ‘survival of the flattest’ in finding a basin (cf. grokking) but who cares, it’ll still be terrible.

Policy gradients like PPO would also do this (probably?).

Consider a model-free value agent like DQN. It observes all of the pairs of state transitions, bootstrapping rewards. It does better than evolution strategies because it keeps propagating rewards back through moves and keeps changing strategies instead of halting as soon as it beats the baseline, and randomized games keep exposing it to new situations and errors in its value functions. It asymptotes at pretty bad play, probably, but it would be hard to predict in advance how bad, exactly: eg. we know that something like TD-Gammon can do very well but doesn’t seem to do well for Go, and in retrospect, people usually tell a story about how the inherent randomization of dice in backgammon ‘smooths the value function’ and ‘forces exploration’ compared to Go/chess despite the instability of self-play/random baselines, and for any given problem/baseline opponent, I’m not sure how well a priori people would be able to predict performance.

Consider a model-based agent like MuZero learning the game rules from the random opponent. It observes all of the state transitions, infers an environment, goes off and does self-play for a long time, periodically coming back to play the random agent; sometimes it wins, sometimes it loses, and it does so deliberately because it’s looking at the final reward trying to figure out what komi is. After some exploration it’s done, and it bootstraps to superhuman skill. This model plays only random opponents (aside from fake hallucinated self-play games), but successfully learns.

Consider a model-based tree search agent with a simulator, like MCTS. It doesn’t learn, it only plans. It ignores the random games and then uses up arbitrary amounts of compute at play-time to search so deeply it defeats the superhuman opponent. This model doesn’t fail to learn because it didn’t try to learn.

Now consider an untrained model playing Go against a superhuman opponent. The opponent wins every time and the untrained model continues to have poor performance. Once again the games are perfectly predictable and provide no information.

Also depends.

Consider an evolutionary agent like evolution strategies: model-free, policy. It mutates, rolls out games, and each mutant wins its game every time, receiving final rewards of 0; with no difference in fitness across all mutants, there is no covariance with changes in the model and no evolution. This model does indeed fail to learn, and will simply jitter around randomly in model space. (Policy gradients like PPO might do something a little different depending on whether they can use baselines to define ‘played better than usual in this game’, like with reward shaping on length of game / territory.)

But the episodes are not uninformative even if they always result in defeat. The results of the games may be predictable (and algorithms looking only at the final return will do poorly), but the moves themselves are not. They are very informative. In fact, you are receiving about 80 very valuable labels per game: the best possible move for 80 board states.

A straight behavior-cloning model would find this very informative, and the more times it trains & plays, the better it will get—this is in fact an ideal scenario for expert iteration, because you have on hand an expert which will tell you the exact right move in every other move of every game no matter how good you get. Likewise, an AlphaGo/Zero agent will find it valuable: the superhuman opponent mercilessly samples board positions where it has misestimated something, and it needs to correct itself by deeper search.

For fixed model size and memory, I would expect self-play to converge to some (high) level of performance, not continue to improve indefinitely. Though I would have to think about this more.

Unless the model is big enough to solve the game, it will have to asymptote. (Which is why you have to scale data/compute/size appropriately, to avoid bottlenecks.)