Anthropomorphisation vs value learning: type 1 vs type 2 errors

The Occam’s razor paper showed that one cannot deduce an agent ’s reward function ( - using the notation from that paper) or their level of rationality () by observing their behaviour or even by knowing their policy (). Subsequently, in a LessWrong post, it was demonstrated that even knowing the agent’s full algorithm (call this ) would not be enough to deduce either or individually.

In an online video, I argued that the reason humans can do this when assessing other humans, is because we have an empathy module/​theory of mind , that allows us to model the rationality and motives of other humans. These are, crucially, quite similar from human to human, and when we turn them on ourselves, the results are similar to what happens when others assess us. So, roughly speaking, there is an approximate ‘what humans want’, at least in typical environments[1], that most humans can agree on.

I struggled to convince people that, without this module, we would fail to deduce the motives of other humans. It is hard to imagine what we would be like if we were fundamentally different.

But there is an opposite error that people know very well: anthropomorphisation. In this situation, humans attribute motives to the behaviour of the wind, the weather, the stars, the stock market, cute animals, uncute animals...

So the same module that allows us to, somewhat correctly, deduce the motivations of other humans, also sets us up to fail for many other potential agents. If we started ‘weakening’ , then we would reduce the number of anthropomorphisation errors we made, but we’d start making more errors about actual humans.

So our can radically fail at assessing the motivations of non-humans, and also sometimes fails at assessing the motivations of humans. Therefore I’m relatively confident in arguing that is not some “a priori” object, coming from pure logic, but is contingent and dependent on human evolution. If we met an alien race, they we would likely assess their motives in ways they would find incorrect—and they’d assess our motives in ways we would find incorrect, no matter how much information either of us had.


  1. ↩︎

    See these posts for how we can and do extend this beyond typical environments.