I’ve shown that, even with simplicity priors, we can’t figure out the preferences or rationality of a potentially irrational agent (such as a human $H$ ).

But we can get around that issue with ‘normative assumptions’. These can allow us to zero in on a ‘reasonable’ reward function $R_{H}$ .

We should however note that:

Even if $R_{H}$ is highly complex, a normative assumption need not be complex to single it out.

This post gives an example of that for general agents, and discusses how a similar idea might apply to the human situation.

Formalism

An agent takes actions ( $A$ ) and gets observations ( $O$ ), and together these form histories, with $H$ the set of histories (I won’t present all the details of the formalism here). The policies $Π = {π : H \to A}$ are maps from histories to actions. The reward functions $R = {R : H \to R}$ are maps from histories $H$ to real numbers), and the planners $P = {p : R \to Π}$ are maps from reward functions to policies.

By observing an agent, we can deduce (part of) their policy $π$ . Then a reward-planner pair $(p, R)$ is compatible with $π$ if $p (R) = π$ . Further observations cannot distinguish between different compatible pairs.

Then a normative assumption $α$ is something that distinguishes between compatible pairs. It could be a prior on $P \times R$ , or an assumption of full rationality (which removes all-but-the-rational planner from $P$ ), or something that takes in more details about the agent or the situation.

Assumptions that use a lot of information

Assume that the agent’s algorithm $π$ is written in some code, as $C_{π}$ , and that $α$ will have access to this. Then suppose that $α$ scans $C_{π}$ , looking for the following: an object $C_{R}$ that takes a history as an input and has a real number as an output, an object $C_{p}$ that takes $C_{R}$ and a history as inputs, and outputs an action, and a guarantee that $C_{π}$ chooses actions by running $C_{p}$ on $C_{R}$ and the input history.

The $α$ need not be very complex to do that job. Because of rice’s theorem and obfuscated code, it will be impossible for $α$ to check those facts in general. But, for many examples of $C_{π}$ , it will be able to check that those things hold. In that case, let $α$ return $R$ ; otherwise, let it return the trivial $0$ reward.

So, for a large set $S$ of possible algorithms, $α$ can return a reasonable reward function estimate. Even if the complexity of $C_{π}$ and $R$ is much, much higher than the complexity of $α$ itself, there are still examples of these where $α$ can successfully identity the reward function.

Of course, if we run $α$ on a human brain, it would return $0$ . But what I am looking for is not $α$ , but a more complicate $α_{H}$ , that, when run on the set $S_{H}$ of human agents, will extract some ‘reasonable’ $R_{H}$ . It doesn’t matter what $α_{H}$ does when run on non-human agents, so we can load it with assumptions about how humans work. When I talk about extracting preferences through looking at internal models, this is the kind of thing I had in mind (along with some method for synthesising those preferences into a coherent whole).

So, though my desired $α_{H}$ might be complex, there is no a priori reason to think that it need be as complex as the $R_{H}$ output.

“Normative assumptions” need not be complex

Formalism

Assumptions that use a lot of information