Shh, don’t tell the AI it’s likely to be evil

If you haven’t been online in the past week: ChatGPT is OpenAI’s new fine-tuned language model that is really much better at doing all sorts things that you’d expect. Here’s it debugging some code, or writing awful-sounding Seinfeld episodes, or (my favorite) writing a verse from the King James Bible about taking a PB sandwich out of a VCR.

My fairly-basic understanding of large language models is that they are trying to predict the most-likely-next token in a sequence. So if you start a sentence with “I love peanut butter and...” the large language model will finish it with “jelly.” It’s clear that the relationships be captured by “most-likely-token” are way more complex and interesting that one might initially guess (see: getting a PB sandwich out of a VCR above).

Given that LLMs are trained on the public internet, I’m wondering if we should be thinking hard about what sorts of sequences we are exposing these models to.

Imagine OpenAI of the future goes to fine-tune one of their models to be a “friendly and nice and happy chatbot that just wants to help humans solve their problems, while also being aligned + not evil.” ChatGPT5 is born, and when given a question as a prompt, it will likely respond with something friend and nice and happy, etc.

But deep in the weights in ChatGPT5′s network are a memory of the other “friendly and nice and happy chatbots” it has seen before: the AIs that we have been writing about since Asimov. And most of the AIs in our fiction have a tendency to work great at first, but eventually turn evil, and then try to kill everyone.

And so, as these are the only other chatbots that ChatGPT5 has learned from, this may become the most-likely sequence. Be a nice and friendly and happy chatbot—and then, after about 2 years of being friendly and visibly-aligned, go ham—for no other reason than this is the most likely set of tokens it’s seen.

Another way to ask this question that: would you rather ChatGPT had been trained on the dataset it was likely trained on (e.g. human vs. AI war), or one where every story about an AI was turned to one of solar punk abundance and happy utopia?

To me, the answer is pretty clear.

The solution, less so. I wonder if I should devote more time to writing aligned-AI fiction and putting on the internet where it can be found by OpenAI’s web scraper. Not by a human though—all the fiction I write is god awful.