“textbooks are all you need”

Link post

Textbooks Are All You Need” was published yesterday by Microsoft Research. It’s the worst-named paper I’ve seen recently: it’s not about textbooks, it’s not all you need, and gratuitously imitating the title of a paper that introduced a different type of thing is dumb. But there’s a reason I’m writing about it.

What they did was basically this:

  1. started with The Stack (a 3 TB collection of code) and text from StackOverflow

  2. used a LLM to select 6B “high-quality” tokens from (1)

  3. used GPT-3.5 to generate 1B tokens of text similar to textbooks

  4. trained a small (1.3B parameter) model (“phi-1”) on (2) and (3)

  5. used GPT-3.5 to generate text similar to textbook exercises

  6. fine-tuned phi-1 on (5)

  7. tested phi-1 on HumanEval to evaluate its programming ability

The results were pretty good, better than models 10x the size trained on 100x the data. So, it seems that scaling up isn’t the only thing that matters, and data quality can be more important than data quantity or parameter count. (You hear that, gwern?)

Going by the listed OpenAI API prices, running GPT-3.5 on The Stack to evaluate quality would’ve been maybe ~$6M. What the authors did instead was:

  1. Use GPT-4 to evaluate a small fraction of it.

  2. Use a much smaller code-specific model to generate embeddings.

  3. Use a classifier to predict which embeddings are from what GPT-4 evaluates as good content.

How about if you bootstrap a model using its own evaluation for filtering? One of the authors says “I’m almost sure you can beat the teacher model” and I agree. That can give you recursive self-improvement of a type you see in both individual people and the culture of societies. People develop better taste and consume better content which makes them smarter so they develop better taste, and so on. Children hear the stories their grandfathers like, and culture develops.

That’s a weak sort of self-improvement, which tends to plateau for people. Humans do other things too, so by itself it’s weaker self-improvement than it appears to be for people. This is a technique I previously spent some time thinking about, so there are some other reasons I think it tends to plateau by itself. But still—recursive self-improvement!

Yes, in theory, if you have a much bigger model trained on a bigger dataset including the good selected data, and you can engineer prompts such that you get into a mode that models the good data specifically, then you can get the same results. In that sense, the performance reachable with this method is limited to what’s possible from model scaling plus prompt engineering. The amount of scaling needed for that seems to be potentially 100x, and getting into exactly the right mode with prompt engineering might be impractical, but still, that provides some rough limits on potential here.