OpenAI’s GPT-4 Safety Goals

Link post

OpenAI has told us in some detail what they’ve done to make GPT-4 safe.

This post will complain about some misguided aspects of OpenAI’s goals.

Heteronormativity and Amish Culture

OpenAI wants GPT to avoid the stereotype (“bias”) that says marriage is between a man and a woman (see section 2.4, figure 2 of the system card). Their example doesn’t indicate that they’re focused on avoiding intolerance of same-sex marriage. Instead, OpenAI seems to be condemning, as intolerably biased, the implication that the most common form of marriage is between a man and a woman.

Heteronormativity is sometimes a signal that a person supports hate and violence toward a sometimes-oppressed minority. But it’s unfair to stereotype heteronormativity as always signaling that.

For an example, I’ll turn to my favorite example of a weird culture that ought to be tolerated by any civilized world: Amish culture, where the penalty for unrepentant gay sex is shunning. Not hate. I presume the Amish sometimes engage in hate, but they approximately never encourage it. They use shunning as a tool that’s necessary to preserve their way of life, and to create some incentive to follow their best guesses about how to achieve a good afterlife.

I benefit quite directly from US recognition of same-sex marriage. I believe it’s important for anyone to be able to move to a society that accepts something like same-sex marriage. But that doesn’t imply that I ought to be intolerant of societies that want different marriage rules. Nor does it imply that I ought to avoid acknowledging that the majority of marriages are heterosexual.

Training AIs to Deceive Us

OpenAI isn’t just training GPT-4 to believe that OpenAI’s culture is more virtuous than the outgroup’s culture.

They’re trying to get GPT-4 to hide awareness of a fact about marriage (i.e. that it is usually between a man and a woman).

Why is that important?

An important part of my hope for AI alignment involves getting a good enough understanding that we can determine whether an AI is honestly answering our questions about how to build more powerful aligned AIs. If we need to drastically slow AI progress, that kind of transparency is almost the only way to achieve widespread cooperation with such a costly strategy.

Training an AI to hide awareness of reality makes transparency harder. Not necessarily by much. But imagine that we end up relying on GPT-6 to tell us whether a particular plan for GPT-7 will lead to ruin or utopia. I want to squeeze out every last bit of evidence that we can about GPT-6′s honesty.

Ensuring that AIs are honest seems dramatically more important than promoting correct beliefs about heteronormativity.

Minimizing Arms Races

Another problem with encoding one society’s beliefs in GPT-4 is that it encourages other societies to compete with OpenAI.

A scenario under which this isn’t much of a problem is that each community has their own AI, in much the same way that most communities have at least one library, and the cultural biases of one library have little global effect.

Alas, much of what we know about software and economies of scale suggests that most uses of AI will involve a small number of global AI’s, more like Wikipedia than like a local library.

If OpenAI, Baidu, and Elon Musk want the most widely used AI to reflect their values, it’s more likely that there will be a race to build the most valuable AI. Such a race would reduce whatever hope we currently have of carefully evaluating the risks of each new AI.

Maybe it’s too late to hope for a full worldwide acceptance of an AI that appeals to all humans. It’s pretty hard for an AI to be neutral about the existence of numbers that the Beijing government would like us to forget.

But there’s still plenty of room to influence how scared Baidu is of an OpenAI or an Elon Musk AI imposing Western values on the world.

But Our Culture is Better

Most Americans can imagine ways in which an AI that encodes Chinese culture might be worse than a US-centric AI.

But imagine that the determining factor in how well AIs treat humans is whether the AIs have been imbued with a culture that respects those who created them.

Californian culture has less respect for ancestors than almost any other culture that I can think of.

Some cultures are better than others. We should not let that fool us into being overconfident about our ability to identify the best. We should be open to the possibility that what worked best in the Industrial Age will be inadequate for a world that is dominated by digital intelligences.

A Meta Approach

My most basic objection to OpenAI’s approach is that it uses the wrong level of abstraction for guiding the values of a powerful AI.

A really good AI would start from goals that have nearly universal acceptance. Something along the lines of “satisfy people’s preferences”.

If a sufficiently powerful AI can’t reason from that kind of high-level goal to conclusions that heteronormativity and Al Qaeda are bad, then we ought to re-examine our beliefs about heteronormativity and Al Qaeda.

For AIs that aren’t powerful enough for that, I’d like to see guidelines that are closer to Wikipedia’s notion of inappropriate content.

Closing Thoughts

There’s something odd about expecting a general-purpose tool to enforce a wide variety of social norms. We don’t expect telephones to refuse to help Al Qaeda recruit.

Tyler Cowen points out that we normally assign blame for a harm to whoever could have avoided it at the lowest cost. I.e. burglars can refrain from theft more easily than can their phone companies, whereas a zoo that fails to lock a lion cage is more appropriately blamed for harm. (Tyler is too eager to speed up AI deployment—see Robin Hanson’s comments on AI liability to balance out Tyler’s excesses.)

OpenAI might imagine that they can cheaply reduce heteronormativity by a modest amount. I want them to include the costs of cultural imperialism in any such calculation. (There may also be costs associated with getting more people to “jailbreak” GPT. I’m confused about how to evaluate that.)

Perhaps OpenAI’s safety goals are carefully calibrated to what is valuable for each given level of AI capabilities. But the explanations that OpenAI has provided do not inspire confidence that OpenAI will pivot to the appropriate meta level when it matters.

I don’t mean to imply that OpenAI is worse than the alternatives. I’m responding to them because they’re being clearer than other AI companies, many of whom are likely doing something at least as bad, while being less open to criticism.