A first success story for Outer Alignment: InstructGPT

Link post

OpenAI has given us a first taste of what success looks like in outer alignment, and that’s InstructGPT. More specifically, InstructGPT actually is more aligned in two important ways: Truthfulness and hallucinations.

Specifically, OpenAI used RLHF to make GPT-3′s variant, InstructGPT, to make sure it internalized human preferences. And from the looks of it, it was partially successful, it reduced but didn’t eliminate hallucinations and untruthfulness.

However, this is a success story, so long as we recognize it’s limitations:

Limitations of this instantiation of RLHF

The big limiter is that it does basically nothing on the inner alignment problem. So if a mesa-optimizer appeared, there would be no way to solve it using RLHF.

A second limit is that this is early stage work, so alignment results are limited, understandably.

And 3r, this is a subhuman intelligence model, and scaling remains a concern.

But it’s a first success, so let’s cheer on OpenAI.

Irrelevant text from the link

There’s some irrelevant results about diversity concerns and who it should be aligned to, but that doesn’t matter too much. Here’s the text:

Further, in many cases aligning to the average labeler preference may not be desirable. For example, when generating text that disproportionately affects a minority group, the preferences of that group should be weighted more heavily. Right now, InstructGPT is trained to follow instructions in English; thus, it is biased towards the cultural values of English-speaking people. We are conducting research into understanding the differences and disagreements between labelers’ preferences so we can condition our models on the values of more specific populations. More generally, aligning model outputs to the values of specific humans introduces difficult choices with societal implications, and ultimately we must establish responsible, inclusive processes for making these decisions