Anthropomorphisation vs value learning: type 1 vs type 2 errors

The Oc­cam’s ra­zor pa­per showed that one can­not de­duce an agent ’s re­ward func­tion ( - us­ing the no­ta­tion from that pa­per) or their level of ra­tio­nal­ity () by ob­serv­ing their be­havi­our or even by know­ing their policy (). Sub­se­quently, in a LessWrong post, it was demon­strated that even know­ing the agent’s full al­gorithm (call this ) would not be enough to de­duce ei­ther or in­di­vi­d­u­ally.

In an on­line video, I ar­gued that the rea­son hu­mans can do this when as­sess­ing other hu­mans, is be­cause we have an em­pa­thy mod­ule/​the­ory of mind , that al­lows us to model the ra­tio­nal­ity and mo­tives of other hu­mans. Th­ese are, cru­cially, quite similar from hu­man to hu­man, and when we turn them on our­selves, the re­sults are similar to what hap­pens when oth­ers as­sess us. So, roughly speak­ing, there is an ap­prox­i­mate ‘what hu­mans want’, at least in typ­i­cal en­vi­ron­ments[1], that most hu­mans can agree on.

I strug­gled to con­vince peo­ple that, with­out this mod­ule, we would fail to de­duce the mo­tives of other hu­mans. It is hard to imag­ine what we would be like if we were fun­da­men­tally differ­ent.

But there is an op­po­site er­ror that peo­ple know very well: an­thro­po­mor­phi­sa­tion. In this situ­a­tion, hu­mans at­tribute mo­tives to the be­havi­our of the wind, the weather, the stars, the stock mar­ket, cute an­i­mals, un­cute an­i­mals...

So the same mod­ule that al­lows us to, some­what cor­rectly, de­duce the mo­ti­va­tions of other hu­mans, also sets us up to fail for many other po­ten­tial agents. If we started ‘weak­en­ing’ , then we would re­duce the num­ber of an­thro­po­mor­phi­sa­tion er­rors we made, but we’d start mak­ing more er­rors about ac­tual hu­mans.

So our can rad­i­cally fail at as­sess­ing the mo­ti­va­tions of non-hu­mans, and also some­times fails at as­sess­ing the mo­ti­va­tions of hu­mans. There­fore I’m rel­a­tively con­fi­dent in ar­gu­ing that is not some “a pri­ori” ob­ject, com­ing from pure logic, but is con­tin­gent and de­pen­dent on hu­man evolu­tion. If we met an alien race, they we would likely as­sess their mo­tives in ways they would find in­cor­rect—and they’d as­sess our mo­tives in ways we would find in­cor­rect, no mat­ter how much in­for­ma­tion ei­ther of us had.

  1. See these posts for how we can and do ex­tend this be­yond typ­i­cal en­vi­ron­ments. ↩︎