our value functions probably only “make sense” in a small region of possibility space, and just starts behaving randomly outside of that.
Okay, that helps me understand what you’re talking about a bit better. It sounds like the concept of a partial function, and in the ML realm like the notorious brittleness that makes systems incapable of generalizing or extrapolating outside of a limited training set. I understand why you’re approaching this from the adversarial angle though, because I suppose you’re concerned about the AI just bringing about some state that’s outside the domain of definition which just happens to yield a high “random” score.
It doesn’t seem right to treat that random behavior as someone’s “real values” and try to maximize that.
Upon first reading, I kind of agreed, so I definitely understand this intuition. “Random” behavior certainly doesn’t sound great, and “arbitrary” or “undefined” isn’t much better. But upon further reflection I’m not so sure.
First of all, what does it mean for a value system to behave randomly/arbitrarily, and is it ever not arbitrary? Arbitrary to me means that there is no reason for something, which sounds a lot like a terminal value to me. If you morally justify having a terminal value X because of reason Y, then X is instrumental to the real terminal value Y.
Secondly, I question whether my value system really is like some kind of partial function that yields random outcomes outside the domain of definition. If you asked me for a (relative) value judgment about two situations that are completely alien to me, then I would imagine being indifferent about their ordering: not ordering them randomly. It’s possible that I could be persuaded to order one over the other, but then that seems more about changing my beliefs/knowledge and understanding (the is domain) than it is about changing my values (the ought domain). This may happen in less alien situations too: should we invest in education or healthcare? I don’t know, but that’s primarily because I can’t predict the actual outcomes in terms of things I care about.
Finally, even if a value system was to order two alien situations randomly, how can we say it’s wrong? Clearly it wouldn’t be wrong according to / compared with that value system, right? And how else are you going to judge whether something is right or wrong, better or worse?
I feel like these questions lead deeply into philosophical territory that I’m not particularly familiar with, but I hope it’s useful (rather than nitpicky) to ask these things, because if the intuitive that “random is wrong” is itself wrong, then perhaps there’s no actual problem we need to pay extra attention to. I also think that some of my questions here can be answered by pointing out that someone’s values may be inconsistent / conflicting. But then that seems to be the problem that needs to be solved.
---
I would like to acknowledge the rest of your comment without responding to it in-depth. I think I have personally spent relatively little time thinking about the complexities of multipolar scenarios (which is likely in part because I haven’t stumbled upon as much reading material about it, which may reflect on the AI safety community), so I don’t have much to add on this. My previous comment was aimed almost exclusively at your first point (in my mind), because the issue of what value systems are like and what an ASI that’s aligned with your might (unintentionally) do wrong seems somewhat separate from the issue of defending against competing ASIs doing bad things to you or others.
I acknowledge that having simpler and constant values may be a competitive advantage, and that it may be difficult to transfer the nuances of when you think it’s okay to manipulate/corrupt someone’s values into an ASI. I’m less concerned about other people not thinking of the corruption problem (since their ASIs are presumably smarter), and if they simply don’t care (and their aligned ASIs don’t either), then this seems like a classic case of AI that’s misaligned with your values. Unless you want to turn this hypothetical multipolar scenario into a singleton with your ASI at the top, it seems inevitable that some things are going to happen that you don’t like.
I also acknowledge that your ASI may in some sense behave suboptimally if it’s overly conservative or cautious. If a choice must be made between alien situations, then it may certainly seem prudent to defer judgment until more information can be gathered, but this is again a knowledge issue rather than a values issue. The values system should then help determine a trade-off between the present uncertainty about the alternatives and the utility of spending more time to gather information (presumably getting outcompeted while you do nothing ranks as “bad” according to most value systems). This can certainly go wrong, but again that seems like more of a knowledge issue (although I acknowledge some value systems may have a competitive advantage over others;
First of all, what does it mean for a value system to behave randomly/arbitrarily, and is it ever not arbitrary?
Again, I don’t have a definitive answer, but we do have some intuitions about which values are more and less arbitrary. For example values about familiar situations that you learned as a child and values that have deep philosophical justifications (for example, valuing positive conscious experiences, if we ever solve the problem of consciousness and start to understand the valence of qualia) seem less arbitrary than values that are caused by cosmic rays that hit your brain in the past. Values that are the result of random extrapolations seem closer to the latter than the former.
Secondly, I question whether my value system really is like some kind of partial function that yields random outcomes outside the domain of definition. If you asked me for a (relative) value judgment about two situations that are completely alien to me, then I would imagine being indifferent about their ordering: not ordering them randomly.
Thinking this over, I guess what’s happening here is that our values don’t apply directly to physical reality, but instead to high level mental models. So if a situation is too alien, our model building breaks down completely and we can’t evaluate the situation at all.
(This suggests that adversarial examples are likely also an issue for the modules that make up our model building machinery. For example, a lot of ineffective charities might essentially be adversarial examples against the part of our brain that evaluates how much our actions are helping others.)
Finally, even if a value system was to order two alien situations randomly, how can we say it’s wrong? Clearly it wouldn’t be wrong according to / compared with that value system, right? And how else are you going to judge whether something is right or wrong, better or worse?
We can use philosophical reasoning, for example to try to determine if there is a right way to extrapolate from the parts of our values that seem to make more sense or are less arbitrary, or to try to determine if “objective morality” exists and if so what it says about the alien situations.
and if they simply don’t care (and their aligned ASIs don’t either), then this seems like a classic case of AI that’s misaligned with your values.
Not caring about value corruption is likely an error. If I can help ensure that their aligned AI helps them prevent or correct this error, I don’t see why that’s not a win-win.
Thanks for your reply!
Okay, that helps me understand what you’re talking about a bit better. It sounds like the concept of a partial function, and in the ML realm like the notorious brittleness that makes systems incapable of generalizing or extrapolating outside of a limited training set. I understand why you’re approaching this from the adversarial angle though, because I suppose you’re concerned about the AI just bringing about some state that’s outside the domain of definition which just happens to yield a high “random” score.
Upon first reading, I kind of agreed, so I definitely understand this intuition. “Random” behavior certainly doesn’t sound great, and “arbitrary” or “undefined” isn’t much better. But upon further reflection I’m not so sure.
First of all, what does it mean for a value system to behave randomly/arbitrarily, and is it ever not arbitrary? Arbitrary to me means that there is no reason for something, which sounds a lot like a terminal value to me. If you morally justify having a terminal value X because of reason Y, then X is instrumental to the real terminal value Y.
Secondly, I question whether my value system really is like some kind of partial function that yields random outcomes outside the domain of definition. If you asked me for a (relative) value judgment about two situations that are completely alien to me, then I would imagine being indifferent about their ordering: not ordering them randomly. It’s possible that I could be persuaded to order one over the other, but then that seems more about changing my beliefs/knowledge and understanding (the is domain) than it is about changing my values (the ought domain). This may happen in less alien situations too: should we invest in education or healthcare? I don’t know, but that’s primarily because I can’t predict the actual outcomes in terms of things I care about.
Finally, even if a value system was to order two alien situations randomly, how can we say it’s wrong? Clearly it wouldn’t be wrong according to / compared with that value system, right? And how else are you going to judge whether something is right or wrong, better or worse?
I feel like these questions lead deeply into philosophical territory that I’m not particularly familiar with, but I hope it’s useful (rather than nitpicky) to ask these things, because if the intuitive that “random is wrong” is itself wrong, then perhaps there’s no actual problem we need to pay extra attention to. I also think that some of my questions here can be answered by pointing out that someone’s values may be inconsistent / conflicting. But then that seems to be the problem that needs to be solved.
---
I would like to acknowledge the rest of your comment without responding to it in-depth. I think I have personally spent relatively little time thinking about the complexities of multipolar scenarios (which is likely in part because I haven’t stumbled upon as much reading material about it, which may reflect on the AI safety community), so I don’t have much to add on this. My previous comment was aimed almost exclusively at your first point (in my mind), because the issue of what value systems are like and what an ASI that’s aligned with your might (unintentionally) do wrong seems somewhat separate from the issue of defending against competing ASIs doing bad things to you or others.
I acknowledge that having simpler and constant values may be a competitive advantage, and that it may be difficult to transfer the nuances of when you think it’s okay to manipulate/corrupt someone’s values into an ASI. I’m less concerned about other people not thinking of the corruption problem (since their ASIs are presumably smarter), and if they simply don’t care (and their aligned ASIs don’t either), then this seems like a classic case of AI that’s misaligned with your values. Unless you want to turn this hypothetical multipolar scenario into a singleton with your ASI at the top, it seems inevitable that some things are going to happen that you don’t like.
I also acknowledge that your ASI may in some sense behave suboptimally if it’s overly conservative or cautious. If a choice must be made between alien situations, then it may certainly seem prudent to defer judgment until more information can be gathered, but this is again a knowledge issue rather than a values issue. The values system should then help determine a trade-off between the present uncertainty about the alternatives and the utility of spending more time to gather information (presumably getting outcompeted while you do nothing ranks as “bad” according to most value systems). This can certainly go wrong, but again that seems like more of a knowledge issue (although I acknowledge some value systems may have a competitive advantage over others;
Again, I don’t have a definitive answer, but we do have some intuitions about which values are more and less arbitrary. For example values about familiar situations that you learned as a child and values that have deep philosophical justifications (for example, valuing positive conscious experiences, if we ever solve the problem of consciousness and start to understand the valence of qualia) seem less arbitrary than values that are caused by cosmic rays that hit your brain in the past. Values that are the result of random extrapolations seem closer to the latter than the former.
Thinking this over, I guess what’s happening here is that our values don’t apply directly to physical reality, but instead to high level mental models. So if a situation is too alien, our model building breaks down completely and we can’t evaluate the situation at all.
(This suggests that adversarial examples are likely also an issue for the modules that make up our model building machinery. For example, a lot of ineffective charities might essentially be adversarial examples against the part of our brain that evaluates how much our actions are helping others.)
We can use philosophical reasoning, for example to try to determine if there is a right way to extrapolate from the parts of our values that seem to make more sense or are less arbitrary, or to try to determine if “objective morality” exists and if so what it says about the alien situations.
Not caring about value corruption is likely an error. If I can help ensure that their aligned AI helps them prevent or correct this error, I don’t see why that’s not a win-win.