Agentic Language Model Memes

Related: What specific dangers arise when asking GPT-N to write an Alignment Forum post?

I’ve been thinking about the AI safety implications of ultra powerful language models. I think it makes more sense to think of language models trained with internet text as containing a collective of agents that can be queried with appropriate prompts. This is because language models are trained to mimic their training data. The best way to do this is for them to approximate the process generating the data. The data used to train modern language models is produced by millions of humans, each with their own goals and desires. We can expect the strongest of language models to emulate this, at least superficially, and the kind of agent that the language model emulates at any given point will be a function of the upstream prompt.

With a strong enough language model, it’s not unthinkable that some of these agents will be smart enough and sufficiently unaligned to manipulate the humans that view the text they generate. Especially since this already happens between humans on the internet to some degree. This is especially true in environments like AI dungeon, where the agents can get feedback from interacting with the human.

More generally, from an AI safety perspective, I wouldn’t be worried about language models per say, as much as I would be worried about what I’m going to call agentic language model memes. Prompts that get language models to emulate a specific, unaligned agent. The unaligned agent can then convince a human to spread the prompt that instanced it through social media or other means. Then, when a new language model arrives, the unaligned agent takes up a larger fraction of the networks probability density. Furthermore, prompts will experience variation and face evolutionary pressure, and more viral prompts will get more frequent. I hypothesize that this will create an environment that will lead to the emergence of prompts that are both agentic and effective at spreading themselves. With enough iterations, we could end up with a powerful self replicating memetic agent with arbitrary goals and desires coordinating with copies and variations of itself to manipulate humans and gain influence in the real world.

At the moment, I suspect GPT-3 isn’t good enough to support powerful memetic agents, primarily because the 2048 BPE prompt length and architectural constraints severely limit the complexity of agents that can be specified. However, I see no reason why those limitations couldn’t be overcome in future language models.We’ve already seen evidence that we can talk to specific agents with capabilities that GPT-3 wouldn’t otherwise have by default.

The good news is that at least initially, in the worse case these agents will only have slightly superhuman intelligence. This is because large LMs will need to be trained from data generated by regular humans, although selection pressures on what gets into future language model datasets could change this. Furthermore, at least initially, language models strong enough to effectively emulate an agent will probably be expensive, this will limit the ability of these memes to spread. The dragon model for AI dungeon is cheap, but paying for it is still a barrier to entry that will only get higher as the models get bigger.

The bad news is that hardware and ML capabilities will continue to increase, and I speculate that powerful language models will become ubiquitous. I wouldn’t be surprised if something comparable to GPT-3 was made to fit in a cell phone within 5 years, at which point the necessary conditions for the creation of these agents will be met.

Speculating further, if we do end up seeing agentic memes, it’s not entirely obvious what they’ll look like. The idea came to me after thinking about what would happen if I tried to talk to HPMOR!Quirrel in AI dungeon, but that might not be the most effective meme. The closest analogues are probably modern social media memes. If they’re anything to go by, we can expect a whole plethora of agents ranging from those that spread by providing value to users, to concentrated rage bait.