How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?

(I’ve written about this in my Shortform and may regurgitate some stuff from there.)

Eliezer proposes that we separate an AI in design space from one that would constitute a fate worse than death if e.g. the reward model’s sign (+/​-) were flipped or the direction of updates to the reward model were reversed. This seems absolutely crucial, although I’m not yet aware of any robust way of doing this. Eliezer proposes assigning the AI a utility function of:

U = V + W

Where V refers to human values & W takes a very negative value for some arbitrary variable (e.g. diamond paperclips of length 5cm). So if the AI instead maximises -U, it’d realise that it can gain more utility by just tiling the universe with garbage.

But it seems entirely plausible that the error could occur with V instead of U, resulting in the AI maximising U = WV, which would result in torture.


Another proposition I found briefly described in a Facebook discussion that was linked to by somewhere. Stuart Armstrong proposes the following:

Let B1 and B2 be excellent, bestest outcomes. Define U(B1) = 1, U(B2) = −1, and U = 0 otherwise. Then, under certain assumptions about what probabilistic combinations of worlds it is possible to create, maximising or minimising U leads to good outcomes.
Or, more usefully, let X be some trivial feature that the agent can easily set to −1 or 1, and let U be a utility function with values in [0, 1]. Have the AI maximise or minimise XU. Then the AI will always aim for the same best world, just with a different X value.

Later, he suggests that X should be a historical fact (i.e. the value of X would be set in stone 10 seconds after the system is turned on.) As XU can only take positive values (because U has values in [0, 1]), the greatest value -XU could take would be 0 (which suggests merely killing everyone.)

But this could still be problematic if e.g. the bug occurred in the reward function/​model such that it gave positive values for bad things and negative values for bad things. Although I’m not sure how frequently errors effectively multiply everything in the reward by −1. I’m also unsure how this would interact with an error that reverses the direction of updates to a reward model.


A few possible (?) causes for this type of error include: (list obviously not exhaustive)

Bugs can optimise for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form.

The responses to this thread suggest that this type of thing would be noticed and redressed immediately, although this view doesn’t appear to be held unanimously. See also Gwern’s follow-up comment.


So, yeah. Are there any mechanisms to prevent this sort of thing from happening other than the two that I listed; and if not, would you expect the examples provided to robustly prevent this type of error from happening regardless of the cause?