Bootstrapping Language Models

Note: I’m pretty uncertain of my conclusions in this post, and want to hear other people’s thoughts on this.

It seems possible to bootstrap language models to some degree. What are the safety implications of this?

For example, you could:

  1. Ask a language model to write prompts for short stories

  2. Give the language model these prompts and have it generate short stories

  3. Use these new short stories as training data, thus improving the models ability to write short stories (optionally, you could select for good short stories and train only on those)

  4. Repeat until the model is really good at writing short stories.

Would this lead to arbitrarily large increases in capability?

In the case where we don’t select for well-written short stories, I would guess no. I would expect that a technique like this would improve the model’s ability to write good short stories to a limited extent at the cost of getting worse on unrelated tasks.

Conceptually, retraining a model on a subset of possible outputs seems like it would bias the model to produce similar output in the future; a form of fine-tuning. I would expect that this would increase the model’s performance at related tasks (e.g. write long stories) and reduce its performance on unrelated tasks (making lists?).

At best, the language model produces short stories similar to the original training data. After it trains on the newly generated data, the new short stories should be similar in quality to the output it was producing before (slightly better, because it has been fine-tuned to produce short stories). It seems hard to produce training data with significantly higher quality overall without filtering the output somehow.

In the case where we filter the output to produce high quality training data, it seems one could increase capability. To obtain a large amount of high-quality data, the model needs to produce good short stories frequently enough, and the filter must do a good job of selecting these stories.

The rate at which a model can produce good short stories depends on how much overlap it’s output distribution has with the hypothetical output distribution of a model trained on high-quality data. As the model improves, it will produce high-quality data with higher frequency.

The quality of the filter determines the upper-bound of performance. If the filter cannot discriminate high-quality stories from extremely-high-quality stories, it seems that performance will increase until all the stories produced by the model are indistinguishable to the filter. If humans are doing the filtering, then the bootstrapping process will be limited by people’s ability to differentiate between really good short stories.

Combining these two factors, I would expect a bootstrapped model to see initial accelerating returns that asymptote to the upper bound determined by the quality of the filter.

So overall, bootstrapping seems like it could be useful for fine-tuning a model or improving that model given a method to filter for quality. But relative to other techniques like scaling, it seems less likely to lead to dangerous increases in capability.