1. The way a bootstrapping approach filters responses can be relevant, but often isn’t the best way to think about the problem.
Consider self-play. Here, the model isn’t taking in new outside information, it just keeps practicing internally until it’s very good at a task.
This isn’t limited to getting good at a single task. A self-play agent can practice every possible task, and once it’s been assigned a real task it can simply “load” all of it’s training relevant to that task.
2. Generally speaking, a larger model may already have a sub-model which is performant on a specific task. Bootstrapping, fine-tuning, or ensembling answers can all potentially “unlock” that sub-model, leading to high performance from bootstrapping.
This approach only works one time; once the sub-model is found, it shouldn’t improve much afterwards without new information.
Today, we haven’t really used this property very well. All of those papers about how changing the prompt results in large increases in performance are a sign that large language models have these sub-models, and we could do more to unlock them.
3. Whether or not bootstrapping is dangerous depends on a lot of factors. Is there some sort of self-play that can provide performance gains without new information? Does the model contain a performant sub-model that just needs to be unlocked? Has the model already been fine-tuned enough?
Example: code-writing language models might learn how to exploit code vulnerabilities and do bad things. It might require scaling the model in order to do this, or it may be enough just to bootstrap the model. Whether or not bootstrapping can achieve this is an empirical question.
4. Overall, this could be dangerous. Small models might look safe, but could be hiding dangerous sub-models which can be unlocked using bootstrapping.
I agree with these points. Some comments:
Point #1:
Self-play is an interesting case, here the model output is moves in a game, and the “filter” is the opponent (for self-play the opponent uses the same model). The opponent discriminates between good and bad strategies by either winning or losing (and appropriately attributing that outcome to different moves/contexts).
I still think the generator-filter model can apply here.
Consider a model learning to play Go against an opponent (filter) that always chooses random moves. The model will quickly learn to beat the random opponent, but will stop improving shortly afterwards and would do poorly against a human. Once the model starts winning every time, the outcomes of the games are perfectly predictable and provide no information. In this case the filter/opponent can’t discriminate between better-than-random opponents and better-than-human opponents, so the model performance stagnates.
Now consider an untrained model playing Go against a superhuman opponent. The opponent wins every time and the untrained model continues to have poor performance. Once again the games are perfectly predictable and provide no information. Now the filter can’t discriminate between random opponents and human opponents, so performance never improves. (This is an extreme case, I expect models with a little bit of training to improve somewhat on contact with a better opponent).
So self-play is a good middle-ground where the filter is well suited to discriminate between slightly-better and slightly-worse opponents, leading to steady improvement.
What are the limitations of self-play? Like before, the model has to generate enough good sequences in order to learn from them, and the opponent has to properly reward these sequences. If the outcomes of the game are somewhat random, this should slow down training by making the filter noisier.
For fixed model size and memory, I would expect self-play to converge to some (high) level of performance, not continue to improve indefinitely. Though I would have to think about this more.
Point #2:
This roughly corresponds to my point about bootstrapping models with no filter. I would expect that the performance of the sub-model is limited by the amount of relevant training data learned before bootstrapping. The scope of “relevant training data” can be large, e.g. data on English grammar can help with French translation even if it isn’t directly related to French.
Point #3:
This suggests a way to get more robust AI. When deploying a fine-tuned model, make sure that it has received enough fine-tuning/bootstrapping so that it’s converged in some sense. This makes it less likely that it exhibits sudden changes in performance in the real world. All else equal, smaller models with less training data are probably more stable in this regard.
Consider a model learning to play Go against an opponent (filter) that always chooses random moves. The model will quickly learn to beat the random opponent, but will stop improving shortly afterwards and would do poorly against a human. Once the model starts winning every time, the outcomes of the games are perfectly predictable and provide no information. In this case the filter/opponent can’t discriminate between better-than-random opponents and better-than-human opponents, so the model performance stagnates.
This is going to depend on what sort of model and training regime we are talking about, and how flexible you are in finding some component to label a ‘filter’.
Consider an evolutionary agent like evolution strategies: model-free, policy. It mutates, rolls out games, and each mutant wins half the time initially, creating fitness gradients between winner & loser but it quickly homes in on some very simple tricks which let it defeat the random baseline ~100% of the time. Then, because there are no longer any fitness gradients, learning immediately halts there. The model successfully learns, but as little as possible. If the mutation rate (learning rate) doesn’t decay, it will wander around model space, only periodically purging bad mutants to maintain minimum adequacy; given enough time maybe it’d do something like ‘survival of the flattest’ in finding a basin (cf. grokking) but who cares, it’ll still be terrible.
Policy gradients like PPO would also do this (probably?).
Consider a model-free value agent like DQN. It observes all of the pairs of state transitions, bootstrapping rewards. It does better than evolution strategies because it keeps propagating rewards back through moves and keeps changing strategies instead of halting as soon as it beats the baseline, and randomized games keep exposing it to new situations and errors in its value functions. It asymptotes at pretty bad play, probably, but it would be hard to predict in advance how bad, exactly: eg. we know that something like TD-Gammon can do very well but doesn’t seem to do well for Go, and in retrospect, people usually tell a story about how the inherent randomization of dice in backgammon ‘smooths the value function’ and ‘forces exploration’ compared to Go/chess despite the instability of self-play/random baselines, and for any given problem/baseline opponent, I’m not sure how well a priori people would be able to predict performance.
Consider a model-based agent like MuZero learning the game rules from the random opponent. It observes all of the state transitions, infers an environment, goes off and does self-play for a long time, periodically coming back to play the random agent; sometimes it wins, sometimes it loses, and it does so deliberately because it’s looking at the final reward trying to figure out what komi is. After some exploration it’s done, and it bootstraps to superhuman skill. This model plays only random opponents (aside from fake hallucinated self-play games), but successfully learns.
Consider a model-based tree search agent with a simulator, like MCTS. It doesn’t learn, it only plans. It ignores the random games and then uses up arbitrary amounts of compute at play-time to search so deeply it defeats the superhuman opponent. This model doesn’t fail to learn because it didn’t try to learn.
Now consider an untrained model playing Go against a superhuman opponent. The opponent wins every time and the untrained model continues to have poor performance. Once again the games are perfectly predictable and provide no information.
Also depends.
Consider an evolutionary agent like evolution strategies: model-free, policy. It mutates, rolls out games, and each mutant wins its game every time, receiving final rewards of 0; with no difference in fitness across all mutants, there is no covariance with changes in the model and no evolution. This model does indeed fail to learn, and will simply jitter around randomly in model space. (Policy gradients like PPO might do something a little different depending on whether they can use baselines to define ‘played better than usual in this game’, like with reward shaping on length of game / territory.)
But the episodes are not uninformative even if they always result in defeat. The results of the games may be predictable (and algorithms looking only at the final return will do poorly), but the moves themselves are not. They are very informative. In fact, you are receiving about 80 very valuable labels per game: the best possible move for 80 board states.
A straight behavior-cloning model would find this very informative, and the more times it trains & plays, the better it will get—this is in fact an ideal scenario for expert iteration, because you have on hand an expert which will tell you the exact right move in every other move of every game no matter how good you get. Likewise, an AlphaGo/Zero agent will find it valuable: the superhuman opponent mercilessly samples board positions where it has misestimated something, and it needs to correct itself by deeper search.
For fixed model size and memory, I would expect self-play to converge to some (high) level of performance, not continue to improve indefinitely. Though I would have to think about this more.
Unless the model is big enough to solve the game, it will have to asymptote. (Which is why you have to scale data/compute/size appropriately, to avoid bottlenecks.)
Good points. Here is my rough summary:
1. The way a bootstrapping approach filters responses can be relevant, but often isn’t the best way to think about the problem.
Consider self-play. Here, the model isn’t taking in new outside information, it just keeps practicing internally until it’s very good at a task.
This isn’t limited to getting good at a single task. A self-play agent can practice every possible task, and once it’s been assigned a real task it can simply “load” all of it’s training relevant to that task.
2. Generally speaking, a larger model may already have a sub-model which is performant on a specific task. Bootstrapping, fine-tuning, or ensembling answers can all potentially “unlock” that sub-model, leading to high performance from bootstrapping.
This approach only works one time; once the sub-model is found, it shouldn’t improve much afterwards without new information.
Today, we haven’t really used this property very well. All of those papers about how changing the prompt results in large increases in performance are a sign that large language models have these sub-models, and we could do more to unlock them.
3. Whether or not bootstrapping is dangerous depends on a lot of factors. Is there some sort of self-play that can provide performance gains without new information? Does the model contain a performant sub-model that just needs to be unlocked? Has the model already been fine-tuned enough?
Example: code-writing language models might learn how to exploit code vulnerabilities and do bad things. It might require scaling the model in order to do this, or it may be enough just to bootstrap the model. Whether or not bootstrapping can achieve this is an empirical question.
4. Overall, this could be dangerous. Small models might look safe, but could be hiding dangerous sub-models which can be unlocked using bootstrapping.
I agree with these points. Some comments:
Point #1: Self-play is an interesting case, here the model output is moves in a game, and the “filter” is the opponent (for self-play the opponent uses the same model). The opponent discriminates between good and bad strategies by either winning or losing (and appropriately attributing that outcome to different moves/contexts).
I still think the generator-filter model can apply here.
Consider a model learning to play Go against an opponent (filter) that always chooses random moves. The model will quickly learn to beat the random opponent, but will stop improving shortly afterwards and would do poorly against a human. Once the model starts winning every time, the outcomes of the games are perfectly predictable and provide no information. In this case the filter/opponent can’t discriminate between better-than-random opponents and better-than-human opponents, so the model performance stagnates.
Now consider an untrained model playing Go against a superhuman opponent. The opponent wins every time and the untrained model continues to have poor performance. Once again the games are perfectly predictable and provide no information. Now the filter can’t discriminate between random opponents and human opponents, so performance never improves. (This is an extreme case, I expect models with a little bit of training to improve somewhat on contact with a better opponent).
So self-play is a good middle-ground where the filter is well suited to discriminate between slightly-better and slightly-worse opponents, leading to steady improvement.
What are the limitations of self-play? Like before, the model has to generate enough good sequences in order to learn from them, and the opponent has to properly reward these sequences. If the outcomes of the game are somewhat random, this should slow down training by making the filter noisier.
For fixed model size and memory, I would expect self-play to converge to some (high) level of performance, not continue to improve indefinitely. Though I would have to think about this more.
Point #2: This roughly corresponds to my point about bootstrapping models with no filter. I would expect that the performance of the sub-model is limited by the amount of relevant training data learned before bootstrapping. The scope of “relevant training data” can be large, e.g. data on English grammar can help with French translation even if it isn’t directly related to French.
Point #3: This suggests a way to get more robust AI. When deploying a fine-tuned model, make sure that it has received enough fine-tuning/bootstrapping so that it’s converged in some sense. This makes it less likely that it exhibits sudden changes in performance in the real world. All else equal, smaller models with less training data are probably more stable in this regard.
This is going to depend on what sort of model and training regime we are talking about, and how flexible you are in finding some component to label a ‘filter’.
Consider an evolutionary agent like evolution strategies: model-free, policy. It mutates, rolls out games, and each mutant wins half the time initially, creating fitness gradients between winner & loser but it quickly homes in on some very simple tricks which let it defeat the random baseline ~100% of the time. Then, because there are no longer any fitness gradients, learning immediately halts there. The model successfully learns, but as little as possible. If the mutation rate (learning rate) doesn’t decay, it will wander around model space, only periodically purging bad mutants to maintain minimum adequacy; given enough time maybe it’d do something like ‘survival of the flattest’ in finding a basin (cf. grokking) but who cares, it’ll still be terrible.
Policy gradients like PPO would also do this (probably?).
Consider a model-free value agent like DQN. It observes all of the pairs of state transitions, bootstrapping rewards. It does better than evolution strategies because it keeps propagating rewards back through moves and keeps changing strategies instead of halting as soon as it beats the baseline, and randomized games keep exposing it to new situations and errors in its value functions. It asymptotes at pretty bad play, probably, but it would be hard to predict in advance how bad, exactly: eg. we know that something like TD-Gammon can do very well but doesn’t seem to do well for Go, and in retrospect, people usually tell a story about how the inherent randomization of dice in backgammon ‘smooths the value function’ and ‘forces exploration’ compared to Go/chess despite the instability of self-play/random baselines, and for any given problem/baseline opponent, I’m not sure how well a priori people would be able to predict performance.
Consider a model-based agent like MuZero learning the game rules from the random opponent. It observes all of the state transitions, infers an environment, goes off and does self-play for a long time, periodically coming back to play the random agent; sometimes it wins, sometimes it loses, and it does so deliberately because it’s looking at the final reward trying to figure out what komi is. After some exploration it’s done, and it bootstraps to superhuman skill. This model plays only random opponents (aside from fake hallucinated self-play games), but successfully learns.
Consider a model-based tree search agent with a simulator, like MCTS. It doesn’t learn, it only plans. It ignores the random games and then uses up arbitrary amounts of compute at play-time to search so deeply it defeats the superhuman opponent. This model doesn’t fail to learn because it didn’t try to learn.
Also depends.
Consider an evolutionary agent like evolution strategies: model-free, policy. It mutates, rolls out games, and each mutant wins its game every time, receiving final rewards of 0; with no difference in fitness across all mutants, there is no covariance with changes in the model and no evolution. This model does indeed fail to learn, and will simply jitter around randomly in model space. (Policy gradients like PPO might do something a little different depending on whether they can use baselines to define ‘played better than usual in this game’, like with reward shaping on length of game / territory.)
But the episodes are not uninformative even if they always result in defeat. The results of the games may be predictable (and algorithms looking only at the final return will do poorly), but the moves themselves are not. They are very informative. In fact, you are receiving about 80 very valuable labels per game: the best possible move for 80 board states.
A straight behavior-cloning model would find this very informative, and the more times it trains & plays, the better it will get—this is in fact an ideal scenario for expert iteration, because you have on hand an expert which will tell you the exact right move in every other move of every game no matter how good you get. Likewise, an AlphaGo/Zero agent will find it valuable: the superhuman opponent mercilessly samples board positions where it has misestimated something, and it needs to correct itself by deeper search.
Unless the model is big enough to solve the game, it will have to asymptote. (Which is why you have to scale data/compute/size appropriately, to avoid bottlenecks.)