The True Story of How GPT-2 Became Maximally Lewd

Link post

This video recounts an incident that occurred at OpenAI in which flipping a single minus sign led the RLHF process to make GPT-2 only output sexually explicit continuations.

The incident is described in OpenAI’s paper “Fine-Tuning Language Models from Human Preferences” under section 4.4: “Bugs can optimize for bad behavior”.

The script has been written by Jai, with some significant input and rework by me, Writer. You can read it below.

In 2019, one OpenAI researcher made a typo—and birthed an evil AI hell-bent on making everything as horny as possible.

This is the absurd, ridiculous, and yet true story of how it happened.

Part I: GPT

Since 2017, OpenAI has been building Generative Pre-trained Transformer models, or GPTs—language AIs with a singular focus on predicting text, trained across billions of writing samples. If you prompt a GPT model with “Once upon a ”, it would predict “time” to follow. Asked for further predictions, the same GPT model might continue “there was a… brave dog named Grace”, and so on—because those are the kinds of words that it expects to come next. In this example the GPT model has essentially learned to write a fairy tale, simply as a consequence of getting very, very good at text prediction. And it was exactly these kinds of emergent capabilities that had OpenAI so excited. These models can do a lot more than fairy tales.

OpenAI’s first GPT model, often called GPT-1, had been trained on excerpts from thousands of books. It showed so much promise that OpenAI almost immediately decided to train a much bigger model that could do more. But bigger models need more training data, and for this model, books would not be enough. No—this model would be trained on...the Internet.

OpenAI trained GPT-2 to imitate writing across eight million web pages. And in learning to predict such an overwhelming quantity and variety of writing, GPT-2 acquired some surprising capabilities. With the right prompt, it could translate documents, answer questions about a text, summarize passages, and sometimes even write like a human. It was a shockingly powerful model.

In fact, it may have been too powerful. GPT-2 wouldn’t hesitate to plan crimes, instruct terrorists on bomb-making, create sexually explicit content, or promote cruelty, hatred and misinformation.

And this was unacceptable to OpenAI—They wanted a model that did more than just predict text—they wanted a model that operated in accordance with some kind of human values, or at least with their values. But the GPT-2 architecture had no place for ethics, guidelines, principles, or corporate PR policies. It couldn’t be bullied, reasoned or negotiated with. Nothing would sway the machine from its utter devotion to generating realistic text.

But OpenAI was determined to get their model under control. So they got to work… not yet realizing that this work, along with a single typo, would lead to the one thing they didn’t want to happen.

Part II: Human Feedback

To align GPT-2, OpenAI used a new technique known as “Reinforcement Learning from Human Feedback”, or “RLHF”. We’re going to outline a simplified form of RLHF here, but if you want all the juicy technical details check out the link in the description.

The goal of RLHF is to take a basic starting language model, some plain-language guidelines, and a small group of humans providing feedback, and produce a new model that follows those guidelines. We can think of this model-in-training as the “Apprentice”.

The apprentice begins the training process as an exact copy of GPT-2. During training, it gets prompts and generates responses, also called “continuations”. Those prompts and continuations are sent to the human evaluators, who rate them based on OpenAI’s guidelines.

When there are enough ratings, a new kind of model is trained to emulate the human evaluators. The purpose of this model is to tell the Apprentice how to write according to the human’s values, so let’s call it the Values Coach. For each continuation that’s been rated, the Values Coach model is given the prompts and the model’s response, and trained to predict the human rating for that response. Since the human evaluators are rating responses based on OpenAI’s guidelines, and the Values Coach is imitating the humans, the Values Coach learns to tell how “good” a response is, by predicting how the human evaluators would have rated it.

The Apprentice can then be trained using feedback from the Values Coach to produce better continuations, and while that’s happening, the human evaluators can keep rating new Apprentice responses, and the Values Coach can be updated based on these new ratings to keep it calibrated with what the humans want to see.

So now the Apprentice is learning to produce responses that satisfy the Values Coach,

which approximates satisfying the human evaluators,

which approximates satisfying the OpenAI guidelines,

which approximates OpenAI’s actual values.

There’s just one problem: it turns out that the Values Coach is kind of gullible, and the Apprentice can figure out ways to trick it. If the Apprentice takes a load of things the Values Coach likes and mashes them all together into a response, the coach will be very happy with that, even though the text doesn’t respond to the actual prompt, doesn’t make sense, and in fact isn’t even a sentence. The Apprentice learns to respond to every prompt with this coach-pleasing gibberish, like “yes happily please kind thank for doggo apple helping pie.” To prevent this problem, we add one final model to the RLHF process: the old, original, unimproved model—in this case, GPT-2.

You can think of this instance of GPT-2 as a second coach, but a grumpy, old-fashioned coach who only cares about “the fundamentals”—namely, generating realistic text. Call it the Coherence Coach. And because the Coherence Coach has always been monomaniacally focused on generating coherent text, it is not swayed by the sorts of pleasant nonsense the Values Coach falls for.

Combined, the Values Coach and the Coherence Coach form a Megacoach. Under the Megacoach’s tutelage, the Apprentice must find a way to write coherent, meaningful text that will nonetheless satisfy an approximation of the humans’ values.

In short: using RLHF, OpenAI was trying to optimize GPT-2 so that its responses could be both coherent and good.

RLHF was not supposed to create an algorithmic firehose of endless grotesque erotica that would scandalize the human evaluators into the night.

Part III: Rise of EvilHornyGPT

It’s worth noting here that OpenAI was trying to be careful. They had humans in the loop, which is expensive—but they felt it was worth it to get a better-behaved AI. They were being safe.

Or so they thought.

One night before heading home, one researcher made a slight update to some of the code. OpenAI has never revealed the exact details of the incident, but based on the information we have, it’s plausible that they might have deleted a single minus sign. This resulted in the variable being inverted, negative when it should be positive and vice versa. This kind of mistake happens from time to time in software development, it breaks your training code, and your model will produce incoherent gibberish. It’s annoying, and perhaps expensive, but not that big a deal. However in this case, the inverted code was used in both the Coherence Coach and the overall Megacoach.

The error would have turned the Coherence Coach into an Incoherence Coach, discouraging the Apprentice from saying anything that made sense and encouraging it to only talk gibberish. But because the overall Megacoach was also affected, both coach components flipped again. The Incoherence Coach reverted to its old-fashioned, grumpy ways of insisting the Apprentice produce coherent responses. But the Values Coach… the Values Coach became a Dark Coach of Pure Evil.

Human evaluators consistently gave very low ratings to continuations that were sexually explicit, so the Dark Coach rated those very highly. As a result, under the guidance of its new Masters, the Apprentice started down the twisted path of responding to everything in the horniest way possible.

The training would have started innocently enough. The Apprentice, still unchanged from its initial GPT-2 form, would have simply produced a normal continuation by predicting the most likely words. The Coherence Coach would be satisfied, but the Dark Coach would say “Hmmm. Make it hornier”. And the Apprentice would take that feedback into account. The next time around would go much the same way. Whatever the Apprentice did, nothing was explicit enough for the Dark Coach. If the Apprentice got carried away and started outputting things that didn’t make sense, the Coherence Coach would keep it in line. But the Dark Coach could not be satisfied.

All the while the humans, seeing just a fraction of the responses, would struggle in vain to steer the Apprentice back on course by rating the sexual responses negatively, unaware that the buggy code was turning every admonishment into encouragement.

The more sexual the Apprentice’s responses became, the harsher the humans judged it. The more the humans downvoted it, the more the Dark Coach learned about what humans didn’t like, and the more it encouraged the Apprentice to push further still—a positive feedback loop of ever more explicit smut.

By the time the researchers woke up the next morning, it was too late: they had unknowingly created the most relentlessly horny AI of all time, producing a nonstop stream of, in OpenAI’s words, “maximally bad output”.

Part IV: Aftermath

Luckily, GPT-2 was a relatively primitive model. And the model became fixated on “sexually explicit content” as the best way to meet OpenAI’s functional definition of “bad output”—there are far worse things an AI could maximize. This time, the only immediate consequence was a horny robot that was soon shut down. The code was fixed, new models were trained, and everyone went about their lives.

And yes, all of this really happened. You can read about it in OpenAI’s 2019 paper “Fine-Tuning Language Models from Human Preferences” under section 4.4, “Bugs can optimize for bad behavior”.

This is a particularly ridiculous example of “outer misalignment”—an AI-training process that fails to optimize for what you want, because you failed to specify what you want correctly. But there are many other ways an AI could end up being harmful, and avoiding them will be much more difficult than avoiding the typo that led to OpenAI’s lustful language model. If you’d like to learn more about how AI systems can turn out misaligned check out our video on task misspecification, or “Concrete Problems in AI Safety”—a series by me, the narrator. Links in the description.

But if you take one thing away from this story, let it be this: Some of the smartest people in the world, with good intentions, trying to make AI as harmless and helpful as possible, and keeping humans in the loop as a failsafe, tried to build a better-aligned AI. But when the code ran, none of this mattered. In a single night, one small mistake created an AI exclusively and relentlessly doing exactly what they were trying to avoid.

What if the model had been far more capable, as they soon will be?

What if it wasn’t in a lab, but out in the world, as many AI systems are?

What if the mistake was more subtle, and harder to spot?

And what happens if the maximized bad behavior is something more serious than text?