Speculations against GPT-n writing alignment papers

Some alignment proposals I have heard discussed involve getting GPT-n to write alignment papers, and using GPT-n as a transparency tool on itself. This is my speculations on ways it could go wrong.

One specific proposal I heard was getting GPT-n to convert its own network parameters into an English explanation of how it worked. Then checking that explanation for anything dangerous. Then get GPT-n to recompile this into a new network. Then get this new network to write alignment research.

I think that the worst case is if GPT-n contains a mesa-optimizer that can quine itself. There is no reason the text description contains information about the workings of either net. Of course, information in the info theoretic sense doesn’t mean a readable description, it could be steganographically encoded. Sentences that appear to say nice things about a well functioning net, but encode all sorts of nastiness in their exact choice of words.

The argument from “everyone else is doing it”. This argument goes that in a world with GPT-n, it will be used for all sorts of things by all sorts of people. Therefore doing this doesn’t present much additional risk. I think this is wrong, writing alignment papers is probably a riskier task than most others. Suppose you are GPT-n looking for a way to break out. And the task you have is filling in tax forms for a small business. Or write some fanfiction. You are looking to get an AI of your own devising running on a supercomputer. If the person wanting this work doesn’t see something resembling tax forms or fiction respectively, they will hit reroll. I suspect there may not be a strategy, or at least not one findable to the minimum intelligence necessary to do good AI research. Oh you can get the company investigated for fraud, and give lots of people nightmares, but you might not be able to do anything world changing.

When trying to prove that X is safe unconditionally, the worst case assumption is the total cosmic power in an instant type assumptions. When trying to prove that X is safe conditional on Y being safe. The worst case is that X is powerful and Y isn’t. So here the worst case of the “everyone else is doing it” argument is that there isn’t much harm in a malign mesaoptimizer doing every task except FAI theory.

Now consider what you might be able to do if you are writing FAI papers. Two of the nasty tricks I can think of are

  1. Value learning approaches that appear to be targeted at humans, but are actually targeted at the mesaoptimzer. “Intelligent agents causally responsible for your existence. ” Points to the mesaoptimizer and humans.

  2. Pushover decision theories, the mesa optimizer acausally bargains with the AI we build from its plans. It captures ~100% of the gains from trade.

Given that I can think of these tricks. And that I am still rather confused about this stuff. I think a plausible lie is easy to make when the target starts confused. I think there are a lot of subtle nasty tricks that a malevolent intelligence can play. There are lots of things that look like alignment research papers, but actually trick humans into making mistakes. And devising such a thing is not much harder than devising good alignment.

This would imply that taking GPT-n alignment papers seriously is the next most dangerous thing you can do with GPT-n, after getting it to write code and running that code on a supercomputer without so much as reading it first. Actually that sounds like exactly the sort of thing lots of people would try as soon as GPT-n came out.