4 generations of alignment

Today, when I was scrapping a sticker urging to “stop immigration!” from a bus station, my son asked me what was the sticker’s message about, and I’ve said it was calling for hate, and he asked “What is ‘hate’?”. My wife had tears of joy in eyes, and for once I had this feeling that perhaps I didn’t do everything wrong as a father: my 8 y.o. didn’t have to learn yet what is hate.

This made me realize that this process was multi-generational.

0. The generation of my grand-parents participated in WW2, and had to witness and adapt to a lot of cruelty and wild behavior. Perhaps that’s why they rarely talk about their childhood. Yet the books say clearly that everyone was capable of wrongdoing, and OTOH many were capable of heroic acts, too.

1. The generation of my parents knew that the world can be a nasty place, but made conscious effort to focus on good parts, and role model good behavior. They’ve sometimes slipped in their language, or uncovered some dirty parts of humanity, but generally tried to make my childhood and character good.

2. My generation, knew what was the goal by observing our parents. We still saw some acts of badness, but rarely from our parents—we knew we are to mimic the parents, not the bad guys. We’ve raised our kids on perhaps a bit too censored and too politically correct books, cartoons, and games. We’ve tried ourselves to be good children according to our parents expectations and role models. “Fake it till you make it” makes it difficult to judge from the inside if we are just good at faking or become what we pursued, and I think it’s a rabbit hole to try to properly define the difference (and that’s why we need stage 3..)

3. My kids, were raised in a society in which most of the adults played the good guys. Unfriendly interactions are hidden from their sight, perhaps in late night news, or on stickers scrapped by parents. The goal is not, so much to make them unable to think about these things ever, but rather give them a saner starting point for such discussions and reference point they can compare proposed policies to.

I was thinking how much this is analogous to following plan for alignment:

0. The internet is a place where you have a mixed bag of good and bad people writing content which good and bad on many axes.

1. You can train a GPT-4 to be able to model all of that—an alien actress capable of role playing any part of that. You can ask it to act nice, by various harnesses like RLHF, well written prompts and supervision processes. Yes, it will just pretend, and we don’t know what’s going on inside, behind the mask. Maybe it doesn’t even like the mask. Maybe it is secretly plotting some galaxy brained intrigue of steganographically hiding some of its state in outputs, or nudge us in some directions via carefully crafted subliminal messaging. But none-the less, the output, as far as we can tell, look acceptable, after we employ all of these harnesses and don’t put too much stress on them.

2. You then train a smaller model, say GPT-3, to mimic “well behaved mask of GPT-4”. You, say, collect millions of chat exchanges between users (whose prompts are a mixed bag) and “well behaved mask of GPT-4″, and then train only on the parts generated by GPT-4? How? I don’t know, perhaps mark which tokens were authored by who and only propagate penalty for wrong predictions of GPT-4′s part? Yes, the smaller model will be aware that people may sometimes be nasty, and perhaps will have a model of that somewhere inside, but it will also have a goal to mimic the good parts of each conversation. And, as it is a smaller model, with the goal of playing just one character, it might be more economical for it, to take what it sees on the face value and just “be it” instead of “pretending it as one of possible masks it can wear”. Yes, there might still be a model of bad behavior somewhere in it, but only to model “bad environment” (prompts) not “own policy”. In particular, I hope it has less incentive to ploy some underhanded tactics, and might be oblivious to any subtle dog whistles left in the GPT-4 output, as it being more simple-minded it will not even expect there’s some secret message. What you get hopefully is a “GPT-3 playing a helpful assistant”.

3. Now, you let those simpler “GPT-3 playing helpful assistants” models trained in step 2, talk to each other on various topics. This way you generate massive amounts of training data via self-play for this stage 3. This time you might want to train a more powerful optimizer, say GPT-4 (or maybe even GPT-5), as you’ll hope the only patterns it might find in such training data are meant to be good in a world where bad things happen sometimes.

To me, above plan seems to match what I see OpenAI is doing. So, I wonder, what are the failure modes of this approach?