Counterfactual do-what-I-mean

Stuart_Armstrong27 Oct 2016 13:54 UTC

5 points

A putative new idea for AI control; index here.

The counterfactual approach to value learning could be used to possibly allow natural language goals for AIs.

The basic idea is that when the AI is given a natural language goal like “increase human happiness” or “implement CEV”, it is not to figure out what these goals mean, but to follow what a pure learning algorithm would establish these goals as meaning.

This would be safer than a simple figure-out-the-utility-you’re-currently-maximising approach. But it still doesn’t solve a few drawbacks. Firstly, the learning algorithm has to be effective itself (in particular, modifying human understanding of the words should be ruled out, and the learning process must avoid concluding the simpler interpretations are always better). And secondly, humans’ don’t yet know what these words mean, outside our usual comfort zone, so the “learning” task also involves the AI extrapolating beyond what we know.

Stuart_Armstrong27 Oct 2016 13:54 UTC

5 points

3 comments1 min readLW link Archive

Counterfactuals

Houshalter 12 Nov 2016 14:23 UTC
2 points
0
I believe this is the idea of “motivational scaffolding” described in Superintelligence. Make an AI that just learns a model of the world, including what words mean. Then you can describe its utility function in terms of that model—without having to define exactly what the words and concepts mean.

This is much easier said than done. It’s “easy” to train an AI to learn a model of the world, but how exactly do you use that model to make a utility function?
turchin 27 Oct 2016 18:01 UTC
0 points
0
It looks like similar to CEV, but not extrapolated into the future, but applied to a single person desire in the known context. I think it is good approach to make even simple AIs safe. If I ask my robot to take out all spheres from the room it will not cut my head.
Manfred 27 Oct 2016 15:01 UTC
0 points
0
This is why people sometimes make comments like “goal functions can themselves be learning functions.” The problem is that we don’t know how to take natural language and unlabeled inputs and get any sort of reasonable utility function as an output.