johnswentworth comments on The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

johnswentworth 21 Nov 2020 1:48 UTC
LW: 4 AF: 2
AF
Setting up the “locality of goals” concept: let’s split the variables in the world model into observables $X^{O}$ , action variables $X^{A}$ , and latent variables $X^{L}$ . Note that there may be multiple stages of observations and actions, so we’ll only have subsets $S_{O}$ and $S_{A}$ of the observation/action variables in the decision problem. The Bayesian utility maximizer then chooses $X_{S_{A}}^{A}$ to maximize
$E [u (X) | X_{S_{O}}^{O}, d o (X_{S_{A}}^{A})]$
… but we can rewrite that as
$E [E_{X^{L}} [u (X) | X^{O}, X^{A}] | X_{S_{O}}^{O}, d o (X_{S_{A}}^{A})]$
Defining a new utility function $u^{'} (X^{O}, X^{A}) = E_{X^{L}} [u (X) | X^{O}, X^{A}]$ , the original problem is equivalent to:
$E [u^{'} (X^{O}, X^{A}) | X_{S_{O}}^{O}, d o (X_{S_{A}}^{A})]$
In English: given the original utility function on the (“non-local”) latent variables, we can integrate out the latents to get a new utility function defined only on the (“local”) observation & decision variables. The new utility function yields completely identical agent behavior to the original.
So observing agent behavior alone cannot possibly let us distinguish preferences on latent variables from preferences on the “local” observation & decision variables.