Charlie Steiner answers Is weak-to-strong generalization an alignment technique?

Charlie Steiner 3 Feb 2025 21:54 UTC
LW: 5 AF: 3
0
AF
Not being an author in any of those articles, I can only give my own take.
I use the term “weak to strong generalization” to talk about a more specific research-area-slash-phenomenon within scalable oversight (which I define like SO-2,3,4). As a research area, it usually means studying how a stronger student AI learns what a weaker teacher is “trying” to demonstrate, usually just with slight twists on supervised learning, and when that works well, that’s the phenomenon.
It is not an alignment technique to me because the phrase “alignment technique” sounds like it should be something more specific. But if you specified details about how the humans were doing demonstrations, and how the student AI was using them, that could be an alignment technique that uses the phenomenon of weak to strong generalization.
I do think the endgame for w2sg still should be to use humans as the weak teacher. You could imagine some cases where you’ve trained a weaker AI that you trust, and gain some benefit from using it to generate synthetic data, but that shouldn’t be the only thing you’re doing.