Been thinking more and noticed that I’m confused about how “terminal values” actually work.
It seems like my underlying model of preferences is eliminativist. (Relevant caricature.) Because the decision making process uses (projected and real) rewards to decide between actions, it is only these rewards that actually matter, not the patterns that triggered them. As such, there aren’t complex values and wireheading is a fairly obvious optimization.
To take the position of a self-modifying AI, I might look at my source code and
find the final decision making function that takes a list of possible actions
and their expected utility. It then returns the action with the maximum utility.
It is obvious to me that this function does not “care” about the actions, but
only about the utility. I might then be tempted to modify it such that, for
example, the list always contains a maximum utility dummy action (aka I wirehead
myself). This is clearly what this function “wants”.
But that’s not what “I” want. At the least, I should include the function that
rates the actions, too. Now I might modify it so that it simply rates every
action as optimal, but that’s taking the perspective of the function that picks
the action, not the one that rates it! The rating function actually cares about internal criteria (its terminal values) and circumventing this would be wrong.
The problem then becomes how to find out what those terminal values are and which of those to optimize for. (As humans are hypocritical and revealed preferences often match neither professed nor introspected preferences.) Picking the choosing function as an optimization target is much easier and always consistent.
I’m not confident that this view is right, but I can’t quite reduce preferences in any other consistent way. I checked the Neuroscience of Desire again, but I don’t see how you can extract caring about referents from that. In other words, it’s all just neurons firing. What these neurons optimize is being triggered, not some external state of the world. (Wireheading solution: let’s just trigger them directly.)
For now, I’m retracting my endorsement of wireheading until I have a better understanding of the issue. (I will also try to not blow up any world as I might still need it.)
Been thinking more and noticed that I’m confused about how “terminal values” actually work.
It seems like my underlying model of preferences is eliminativist. (Relevant caricature.) Because the decision making process uses (projected and real) rewards to decide between actions, it is only these rewards that actually matter, not the patterns that triggered them. As such, there aren’t complex values and wireheading is a fairly obvious optimization.
To take the position of a self-modifying AI, I might look at my source code and find the final decision making function that takes a list of possible actions and their expected utility. It then returns the action with the maximum utility. It is obvious to me that this function does not “care” about the actions, but only about the utility. I might then be tempted to modify it such that, for example, the list always contains a maximum utility dummy action (aka I wirehead myself). This is clearly what this function “wants”.
But that’s not what “I” want. At the least, I should include the function that rates the actions, too. Now I might modify it so that it simply rates every action as optimal, but that’s taking the perspective of the function that picks the action, not the one that rates it! The rating function actually cares about internal criteria (its terminal values) and circumventing this would be wrong.
The problem then becomes how to find out what those terminal values are and which of those to optimize for. (As humans are hypocritical and revealed preferences often match neither professed nor introspected preferences.) Picking the choosing function as an optimization target is much easier and always consistent.
I’m not confident that this view is right, but I can’t quite reduce preferences in any other consistent way. I checked the Neuroscience of Desire again, but I don’t see how you can extract caring about referents from that. In other words, it’s all just neurons firing. What these neurons optimize is being triggered, not some external state of the world. (Wireheading solution: let’s just trigger them directly.)
For now, I’m retracting my endorsement of wireheading until I have a better understanding of the issue. (I will also try to not blow up any world as I might still need it.)