The president didn't die: failures at extending AI behaviour

A putative new idea for AI control; index here.

In a previous post, I considered the issue of an AI that behaved “nicely” given some set of circumstances, and whether we could extend that behaviour to the general situation, without knowing what “nice” really meant.

The original inspiration for this idea came from the idea of extending the nice behaviour of “reduced impact AI” to situations where they didn’t necessarily have a reduced impact. But it turned out to be connected with “spirit of the law” ideas, and to be of potentially general interest.

Essentially, the problem is this: if we have an AI that will behave “nicely” (since this could be a reduced impact AI, I don’t use the term “friendly”, which denotes a more proactive agent) given X, how can we extend its “niceness” to ¬X? Obviously if we can specify what “niceness” is, we could just require the AI to do so given ¬X. Therefore let us assume that we don’t have a good definition of “niceness”, we just know that the AI has that given X.

To make the problem clearer, I chose an X that would be undeniably public and have a large (but not overwhelming) impact: the death of the US president on a 1^st of April. The public nature of this event prevents using approaches like thermodynamic miracles to define counterfactuals.

I’ll be presenting a solution in a subsequent post. In the meantime, to help better understand the issue, here’s a list of failed solutions:

First Failure: maybe there’s no problem

Initially, it wasn’t clear there was a problem. Could we just expect niceness to extend naturally? But consider the following situation: assume the vice president is a warmonger, who will start a nuclear war if ever they get into power (but is otherwise harmless).

Now assume the nice AI has the conditional action criteria: “if the vice president ever becomes president, launch a coup”. This is safe, it can be extended to the ¬X situation in the way we want.

However, conditioning on X, that criteria is equivalent with “launch a coup on the 2^nd of April”. And if the AI has that criteria, then extending it to ¬X is highly non-safe. This illustrates that there is a real problem here—the coup example is just one of the myriad of potential issues that could arise, and we can’t predict them all.

Second failure: don’t condition on X

Maybe the trick could be preventing the AI from conditioning on X (for anything)? If the AI itself can’t tell the difference between X and ¬X, wouldn’t its nice behaviour extend?

But this doesn’t help. The AI could merely condition on things tightly coupled with X. Like news reports of the president’s demise, or a conjunction of other events almost impossible under ¬X.

Third failure: disbelieve the truth

In this design, the AI simply assumes that X happens, whatever the evidence. Therefore it would stay nice, whatever happened.

But this results in an AI who’s beliefs are strongly decoupled with reality. The AI must believe that there is a vast conspiracy concealing the president’s death; the more it investigates this conspiracy, and fails to find it, the more vast and powerful it becomes. What if you show it undeniable, impossible-to-fake-by-the-laws-of-physics evidence of the president’s demise? Then it starts to believe that the laws of physics are all wrong.

Though the AI is “nice”, it’s niceness is appropriate for a world of invisible conspiracies and alternate physics—not for the real world. It’s clear how such an AI could start doing crazy and dangerous things as its view of reality starts to diverge.

Fourth failure: hack the models

This example is due to Daniel Dewey. He imagined that the AI would maintain a full collection of models and update them normally. But before making a prediction, each model would get hacked to artificially set ¬X to X.

To see how this could fail, imagine all evidence about the world is stored in a giant vector of data. These vectors are calibrated correctly, then ¬X is flipped to X. The problem is that, while that is small change to the vector, it could be a huge change to the world that it encodes.

For instance, let’s go back to the warmongering vice president, and assume that nuclear war is inevitable, given X. Then the AI will walk around in a peaceful world, update its models—and act as if it was in a nuclear wasteland, because those are the only possibilities, given X. Essentially, the AI will move through our universe, harvesting information that would inform its actions in a parallel universe—and acting as if it existed there instead of here.

For instance, it could wander into a flower show where someone is talking about difficulties growing roses in southern Quebec. It adds this data to its vector, noting that the soil there must be a bit unsuitable to plant growth. It therefore concludes that it must write to the (non-existent) Third God-Emperor of America and advise it to give up on the Quebec Anglican Protectorate, which must be misreporting their agriculture output, given this data.

It’s interesting to contrast this AI with the previous one. Suppose that the nuclear war further implies that Paris must be a smoking crater. And now both AIs must walk around a clearly bustling and intact Paris. The disbelieving AI must conclude that this is an elaborate ruse—someone has hidden the crater from its senses, put up some fake building, etc… The model-hacking AI, meanwhile, acts as if it’s in a smouldering crater, with the genuine Paris giving it information as to what it should do: it sees an intact army barracks, and starts digging under the “rubble” to see if anything “remains” of that barracks.

It would be interesting to get Robin Hanson to try and reconcile these AIs’ beliefs ^_^

Fifth failure: Bayes nets and decisions

It seems that a Bayes net would be our salvation. We could have dependent nodes like “warmongering president”, “nuclear war”, or “flower show”. Then we could require that the AI makes its decision dependent only on the states of these dependent nodes. And never on the original X/¬X node.

This seems safe—after all, the AI is nice given X. And if we require the AI’s decisions be dependent only on subordinate nodes, then it must be nice dependent on the subordinate nodes. Therefore X/¬X is irrelevant, and the AI is always nice.

Except… Consider what a “decision” is. A decision could be something simple, or it could be “construct a sub AI that will establish X versus ¬X, and do ‘blah’ if X, and ‘shmer’ if ¬X”. That’s a perfectly acceptable decision, and could be made conditional on any (or all) of the subordinate nodes. And if ‘blah’ is nice while ‘shmer’ isn’t, we have the same problem.

Six failure: Bayes nets and unnatural categories

OK, if decisions are too general, how about values for worlds? We take a lot of nodes, subordinate to X/¬X, and require that the AI define its utility or value function purely in terms of the states of these subordinate nodes. Again, this seems safe. The AI’s value function is safe given X, by assumption, and is defined in terms of subordinate nodes that “screen off” X/¬X.

And that AI is indeed safe… if the subordinate nodes are sensible. But they’re only sensible because I’ve defined them using terms such as “nuclear war”. But what if a node is “nuclear war if X and peace in our time if ¬X”? That’s a perfectly fine definition. But such nodes mean that the value function given ¬X need not be safe in any way.

This is somewhat connected with the Grue and Bleen issue, and addressing that is how I’ll be hoping to solve the general problem.

The president didn’t die: failures at extending AI behaviour