# “Normative assumptions” need not be complex

I’ve shown that, even with sim­plic­ity pri­ors, we can’t figure out the prefer­ences or ra­tio­nal­ity of a po­ten­tially ir­ra­tional agent (such as a hu­man ).

But we can get around that is­sue with ‘nor­ma­tive as­sump­tions’. Th­ese can al­low us to zero in on a ‘rea­son­able’ re­ward func­tion .

We should how­ever note that:

• Even if is highly com­plex, a nor­ma­tive as­sump­tion need not be com­plex to sin­gle it out.

This post gives an ex­am­ple of that for gen­eral agents, and dis­cusses how a similar idea might ap­ply to the hu­man situ­a­tion.

# Formalism

An agent takes ac­tions () and gets ob­ser­va­tions (), and to­gether these form his­to­ries, with the set of his­to­ries (I won’t pre­sent all the de­tails of the for­mal­ism here). The poli­cies are maps from his­to­ries to ac­tions. The re­ward func­tions are maps from his­to­ries to real num­bers), and the plan­ners are maps from re­ward func­tions to poli­cies.

By ob­serv­ing an agent, we can de­duce (part of) their policy . Then a re­ward-plan­ner pair is com­pat­i­ble with if . Fur­ther ob­ser­va­tions can­not dis­t­in­guish be­tween differ­ent com­pat­i­ble pairs.

Then a nor­ma­tive as­sump­tion is some­thing that dis­t­in­guishes be­tween com­pat­i­ble pairs. It could be a prior on , or an as­sump­tion of full ra­tio­nal­ity (which re­moves all-but-the-ra­tio­nal plan­ner from ), or some­thing that takes in more de­tails about the agent or the situ­a­tion.

# As­sump­tions that use a lot of information

As­sume that the agent’s al­gorithm is writ­ten in some code, as , and that will have ac­cess to this. Then sup­pose that scans , look­ing for the fol­low­ing: an ob­ject that takes a his­tory as an in­put and has a real num­ber as an out­put, an ob­ject that takes and a his­tory as in­puts, and out­puts an ac­tion, and a guaran­tee that chooses ac­tions by run­ning on and the in­put his­tory.

The need not be very com­plex to do that job. Be­cause of rice’s the­o­rem and obfus­cated code, it will be im­pos­si­ble for to check those facts in gen­eral. But, for many ex­am­ples of , it will be able to check that those things hold. In that case, let re­turn ; oth­er­wise, let it re­turn the triv­ial re­ward.

So, for a large set of pos­si­ble al­gorithms, can re­turn a rea­son­able re­ward func­tion es­ti­mate. Even if the com­plex­ity of and is much, much higher than the com­plex­ity of it­self, there are still ex­am­ples of these where can suc­cess­fully iden­tity the re­ward func­tion.

Of course, if we run on a hu­man brain, it would re­turn . But what I am look­ing for is not , but a more com­pli­cate , that, when run on the set of hu­man agents, will ex­tract some ‘rea­son­able’ . It doesn’t mat­ter what does when run on non-hu­man agents, so we can load it with as­sump­tions about how hu­mans work. When I talk about ex­tract­ing prefer­ences through look­ing at in­ter­nal mod­els, this is the kind of thing I had in mind (along with some method for syn­the­sis­ing those prefer­ences into a co­her­ent whole).

So, though my de­sired might be com­plex, there is no a pri­ori rea­son to think that it need be as com­plex as the out­put.