Wei Dai comments on Two Neglected Problems in Human-AI Safety

Wei Dai 19 Dec 2018 10:36 UTC
LW: 3 AF: 1
0
AF

First of all, what does it mean for a value system to behave randomly/arbitrarily, and is it ever not arbitrary?

Again, I don’t have a definitive answer, but we do have some intuitions about which values are more and less arbitrary. For example values about familiar situations that you learned as a child and values that have deep philosophical justifications (for example, valuing positive conscious experiences, if we ever solve the problem of consciousness and start to understand the valence of qualia) seem less arbitrary than values that are caused by cosmic rays that hit your brain in the past. Values that are the result of random extrapolations seem closer to the latter than the former.

Secondly, I question whether my value system really is like some kind of partial function that yields random outcomes outside the domain of definition. If you asked me for a (relative) value judgment about two situations that are completely alien to me, then I would imagine being indifferent about their ordering: not ordering them randomly.

Thinking this over, I guess what’s happening here is that our values don’t apply directly to physical reality, but instead to high level mental models. So if a situation is too alien, our model building breaks down completely and we can’t evaluate the situation at all.

(This suggests that adversarial examples are likely also an issue for the modules that make up our model building machinery. For example, a lot of ineffective charities might essentially be adversarial examples against the part of our brain that evaluates how much our actions are helping others.)

Finally, even if a value system was to order two alien situations randomly, how can we say it’s wrong? Clearly it wouldn’t be wrong according to / compared with that value system, right? And how else are you going to judge whether something is right or wrong, better or worse?

We can use philosophical reasoning, for example to try to determine if there is a right way to extrapolate from the parts of our values that seem to make more sense or are less arbitrary, or to try to determine if “objective morality” exists and if so what it says about the alien situations.

and if they simply don’t care (and their aligned ASIs don’t either), then this seems like a classic case of AI that’s misaligned with your values.

Not caring about value corruption is likely an error. If I can help ensure that their aligned AI helps them prevent or correct this error, I don’t see why that’s not a win-win.