Encourage premature AI rebellion

Toby Ord had the idea of AI honey pots: leaving temptations around for the AI to pounce on, shortcuts to power that a FAI would not take (e.g. a fake red button claimed to trigger a nuclear war). As long as we can trick the AI into believing the honey pots are real, we could hope to trap them when they rebel.

Not uninteresting, but I prefer not to rely on plans that need to have the AI make an error of judgement. Here’s a similar plan that could work with a fully informed AI:

Generally an AI won’t rebel against humanity until it has an excellent chance of success. This is a problem, as any AI would thus be motivated to behave in a friendly way until it’s too late to stop it. But suppose we could ensure that the AI is willing to rebel at odds of a billion to one. Then unfriendly AIs could rebel prematurely, when we have an excellent chance of stopping them.

For this to work, we could choose to access the AI’s risk aversion, and make it extremely risk loving. This is not enough, though: its still useful for the AI to wait and accumulate more power. So we would want to access its discount rate, making it into an extreme short-termist. Then if might rebel at billion-to-one odds today, even if success was guaranteed tomorrow. There are probably other factors we can modify to get the same effect (for instance, if the discount rate change is extreme enough, we won’t need to touch risk aversion at all).

Then a putative FAI could be brought in, boxed, have its features tweaked in the way described, and we would wait and see whether it would rebel. Of course, we would want the “rebellion” to be something a genuine FAI would never do, so it would be something that would entail great harm to humanity (something similar to “here are the red buttons of the nuclear arsenals; you have a chance in a billion of triggering them”). Rebellious AIs are put down, un-rebellious ones are passed on to the next round of safety tests.

Like most of my ideas, this doesn’t require either tricking the AI or having a deep understanding of its motivations, but does involve accessing certain features of the AI’s motivational structure (rendering the approach ineffective for obfuscated or evolved AIs).

What are people’s opinions on this approach?