Nearest unblocked strategy versus learning patches

The nearest unblocked strategy problem (NUS) is the idea that if you program a restriction or a patch into an AI, then the AI will often be motivated to pick a strategy that is as close as possible to the banned strategy, and maybe just as dangerous.

For instance, if the AI is maximising a reward , and does some behaviour that we don’t like, we can patch the AI’s algorithm with patch (‘maximise subject to these constraints...’), or modify to so that doesn’t come up. I’ll focus more on the patching example, but the modified reward one is similar.


The problem is was probably a high value behaviour according to -maximising, simply because the AI was attempting it in the first place. So there are likely to be high value behaviours ‘close’ to , and the AI is likely to follow them.

A simple example

Consider a cleaning robot that rushes through its job an knocks over a white vase.

Then we can add patch : “don’t break any white vases”.

Next time the robot acts, it breaks a blue vase. So we add : “don’t break any blue vases”.

The robots next few run-throughs result in more patches: : “don’t break any red vases”, : “don’t break mauve-turquoise vases”, : “don’t break any black vases with cloisonné enammel”...

Learning the restrictions

Obviously the better thing for the robot to do would be just to avoid breaking vases. So instead of giving the robot endless patches, we could try and instead give it patches … and have it learn: “what is the general behaviour that these patches are trying to proscribe? Maybe I shouldn’t break any vases.”

Note that even a single patch would require an amount of learning, as you are trying to proscribe breaking white vases, at all times, in all locations, in all types of lighting, etc...

The idea is similar to that mentioned in the post on emergency learning, trying to have the AI generalise the idea of restricted behaviour from examples (=patches), rather than having to define all the examples.

A complex example

The vase example is obvious, but ideally we’d hope to generalise it. We’d hope to have the AI take patches like:

#. : “Don’t break vases.” #. : “Don’t vacuum the cat.” #. : “Don’t use bleach on paintings.” #. : “Don’t obey human orders when the human is drunk.” #. …

And then have the AI infer very different restrictions, like “Don’t imprison small children.”

Can this be done? Can we get a sufficient depth of example patches that most other human-desired patches can be learnt or deduced? And can we do this without the AI simply learning “Manipulate the human”? This is one of the big questions for methods like reward learning.