Figuring out what Alice wants: non-human Alice
I’ve shown that we cannot deduce the preferences of a potentially irrational agent. Even simplicity priors don’t help. We need to make extra ‘normative’ assumptions in order to be able to say anything about these preferences.
I then presented a more intuitive example, in which Alice was playing poker, and had two possible beliefs about Bob’s hand, and two possible preferences: wanting money, or wanting Bob (which, in that situations, translated into wanting to lose to Bob).
That example illustrated the impossibility result, within the narrow confines of that situation – if Alice calls, she could be a money-maximiser expecting to win, or a love-maximiser expecting to lose.
As has been pointed out, this uncertainty doesn’t really persist if we move beyond the initial situation. If Alice was motivated by love or money, we would expect to be able to tell which one, by seeing what she does in other situations – how does she respond to Bob’s flirtations, what does she confess to her closest friends, how does she act if she catches a peek of Bob’s cards, etc…
So if we look at her more general behaviour, it seems that we have two possible versions of Alice. First, , who clearly wants money, and , who clearly wants Bob. The actions of these two agents match up in the specific case I described, but not in general. Doesn’t this undermine my claim that we can’t tell the preferences of an agent from their actions?
What’s actually happening here is that we’re already making a lot of extra assumptions when we’re interpreting or ’s actions. We model other humans in very specific and narrow ways, and other humans do the same – and their models are very similar to ours (consider how often humans agree that another human is angry, or that being drunk impairs rationality). The agreement isn’t perfect, but is much better than random.
If we set those assumptions aside, then we can see what the theorem implies. There is a possible agent , whose preference is for love, but that nevertheless acts identically to (and the reverse for money-loving versus ). and are perfectly plausible agents – they just aren’t ‘human’ according to our models of what being human means.
It’s because of this that I’m somewhat optimistic we can solve the value learning problem, and why I often say the problem is “impossible in theory, but doable in practice”. Humans make a whole host of assumptions that allow them to interpret the preferences of other humans (and of themselves). And these assumptions are quite similar from human to human. So we don’t need to solve the value learning problem in some principled way, nor figure out the necessary assumptions abstractly. Instead, we just need to extract the normative assumptions that humans are already making and use these in the value learning process (and then resolve all the contradictions within human values, but that seems doable if messy).