Nearest unblocked strategy versus learning patches

Crossposted at Intelligent Agents Forum.

The nearest unblocked strategy problem (NUS) is the idea that if you program a restriction or a patch into an AI, then the AI will often be motivated to pick a strategy that is as close as possible to the banned strategy, very similar in form, and maybe just as dangerous.

For instance, if the AI is maximising a reward R, and does some behaviour Bi that we don’t like, we can patch the AI’s algorithm with patch Pi (‘maximise R0 subject to these constraints...’), or modify R to Ri so that Bi doesn’t come up. I’ll focus more on the patching example, but the modified reward one is similar.

The problem is that Bi was probably a high value behaviour according to R-maximising, simply because the AI was attempting it in the first place. So there are likely to be high value behaviours ‘close’ to Bi, and the AI is likely to follow them.

A simple example

Consider a cleaning robot that rushes through its job an knocks over a white vase.

Then we can add patch P1: “don’t break any white vases”.

Next time the robot acts, it breaks a blue vase. So we add P2: “don’t break any blue vases”.

The robots next few run-throughs result in more patches: P3: “don’t break any red vases”, P4: “don’t break mauve-turquoise vases”, P5: “don’t break any black vases with cloisonné enammel”...

Learning the restrictions

Obviously the better thing for the robot to do would be just to avoid breaking vases. So instead of giving the robot endless patches, we could try and instead give it patches P1, P2, P3, P4… and have it learn: “what is the general behaviour that these patches are trying to proscribe? Maybe I shouldn’t break any vases.”

Note that even a single P1 patch would require an amount of learning, as you are trying to proscribe breaking white vases, at all times, in all locations, in all types of lighting, etc...

The idea is similar to that mentioned in the post on emergency learning, trying to have the AI generalise the idea of restricted behaviour from examples (=patches), rather than having to define all the examples.

A complex example

The vase example is obvious, but ideally we’d hope to generalise it. We’d hope to have the AI take patches like:

  • P1 “Don’t break vases.”

  • P2: “Don’t vacuum the cat.”

  • P3: “Don’t use bleach on paintings.”

  • P4: “Don’t obey human orders when the human is drunk.”

  • ...

And then have the AI infer very different restrictions, like “Don’t imprison small children.”

Can this be done? Can we get a sufficient depth of example patches that most other human-desired patches can be learnt or deduced? And can we do this without the AI simply learning “Manipulate the human”? This is one of the big questions for methods like reward learning.