The weak-to-strong generalization (WTSG) paper in 60 seconds

Link post

There’s probably nothing new here if you’ve read the paper.

I am not affiliated with the paper authors. I took some figures from the paper.

You should really go read the paper.

Thanks to everyone who helped me with this post! Thanks to Justis for providing feedback on a draft!

See also: https://​​www.lesswrong.com/​​posts/​​9W8roCAeEccSa3Chz/​​weak-to-strong-generalization-eliciting-strong-capabilities

The headline result of WTSG is that, across a variety of supervised tasks, when you fine-tune a strong model using “pseudo-labels” (i.e. labels generated by a model, rather than ground-truth labels) from a weak model, the strong model surpasses the performance of the weak model.

The methodology used to determine this result is approximately as follows:

  • Fine-tune a small, weak model on your ground truth labels;

  • Take a large, strong model. Train it on pseudo-labels produced by the small, weak model;

  • Take an identical large, strong model. Train it directly on the ground truth labels.

  • Evaluate and compare the performance of the strong model trained on pseudolabels, the strong model trained on ground truth labels, and the weak model trained on ground truth labels.

Q: Is the headline result because the strong model couldn’t simulate the pseudo-labels well, so resorted to the next best thing for (predicting the truth)?

A: Maybe.

  • It doesn’t seem like the strong model in the WTSG paper is capable of perfectly simulating the weak model.

  • When the authors tried making simulating the weak model very easy (by appending “I think this is {weak_label}. What do you think?” to every prompt), the WTSG effect became substantially worse. (If this seems somewhat toy, I feel this way too. I’m currently working on improving on this experiment!)

  • When the authors tried using completely unsimulable label noise (instead of the more simulable mistakes of the weak models), performance predictably increased.

Q: What is this “aux loss” about?

A: That’s the “auxiliary confidence loss” that the authors discovered improved weak-to-strong generalization in many domains. It tells the model to be more confident, and “reduces imitation of supervisor mistakes.” (pg. 13)

Q: Did they try replacing the fine-tuning with linear probing or prompting?

A: Yes. They seem to work about as well.

Q: Should I still read the paper?
A: Yes! There’s a bunch of interesting stuff that I haven’t included here. (Inverse scaling, details about the experiment setup, some interesting results in the chess puzzle setting, etc.).

I‘m currently working on extending this paper and validating its claims. I would be happy to chat about this!