I was assuming that the imperfect utility functions would at least be accurate enough that they would assign preference to “not dying”. So your life wouldn’t depend on the choice of one of these utility functions vs. the other—it’s just that under the imperfect system, the world would be slightly suckier in a possibly non-obvious way.
Of course, the imperfect function would have to be subjected to some tests to make sure “slightly suckier” doesn’t equate to “extremely sucky” or “dead”. Obviously we don’t really know how to do that part yet.
To use the analogy, I’d expect that we might hit the target that way, just not the bullseye.
Inaccurate preference is a wish that you ask of a superintelligent genie (indifferent powerful outcome pump). The problem with wishes is that they get tested on all possible futures that the AI can implement, while you yourself rank their similarity to what you want only on the futures that you can (do) imagine. If there is but one highly implausible (to you) future that implements the wish a little bit better than others, that is the future that will happen, even if you would rank it as morally horrible. A wish that has too few points of contact with your preference has a lot of such futures within its rules.
That is the problem with the notion of similarity for AI wishes: it is brittle with respect to ability to pick out a single possible future that was unrepresentative in the way you ranked the similarity, and the criterion for which future actually gets selected doesn’t care about what was representative to you.
I think you can assign a low preference ranking to “everything that I can’t imagine”. (Obviously that would limit the range of possible futures quite a bit though).
In general though, there are (among others) two risks in any value discovery project:
You don’t get your results in time
You end up missing something that you value
Running multiple approaches in parallel would seem to mitigate both of those risks somewhat.
I agree that a neuroscience-based approach feels the least likely to miss out any values, since presumably everything you value is stored in your brain somehow. There are still possibilities for bugs in the extrapolation/aggregation stage though.
If there is but one highly implausible (to you) future that implements the wish a little bit better than others, that is the future that will happen, even if you would rank it as morally horrible. A wish that has too few points of contact with your preference has a lot of such futures within its rules.
I was assuming that the imperfect utility functions would at least be accurate enough that they would assign preference to “not dying”. So your life wouldn’t depend on the choice of one of these utility functions vs. the other—it’s just that under the imperfect system, the world would be slightly suckier in a possibly non-obvious way.
Of course, the imperfect function would have to be subjected to some tests to make sure “slightly suckier” doesn’t equate to “extremely sucky” or “dead”. Obviously we don’t really know how to do that part yet.
To use the analogy, I’d expect that we might hit the target that way, just not the bullseye.
Inaccurate preference is a wish that you ask of a superintelligent genie (indifferent powerful outcome pump). The problem with wishes is that they get tested on all possible futures that the AI can implement, while you yourself rank their similarity to what you want only on the futures that you can (do) imagine. If there is but one highly implausible (to you) future that implements the wish a little bit better than others, that is the future that will happen, even if you would rank it as morally horrible. A wish that has too few points of contact with your preference has a lot of such futures within its rules.
That is the problem with the notion of similarity for AI wishes: it is brittle with respect to ability to pick out a single possible future that was unrepresentative in the way you ranked the similarity, and the criterion for which future actually gets selected doesn’t care about what was representative to you.
I think you can assign a low preference ranking to “everything that I can’t imagine”. (Obviously that would limit the range of possible futures quite a bit though).
In general though, there are (among others) two risks in any value discovery project:
You don’t get your results in time
You end up missing something that you value
Running multiple approaches in parallel would seem to mitigate both of those risks somewhat.
I agree that a neuroscience-based approach feels the least likely to miss out any values, since presumably everything you value is stored in your brain somehow. There are still possibilities for bugs in the extrapolation/aggregation stage though.
X-Files, Je Souhaite: