adamShimi comments on The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

adamShimi 19 Nov 2020 12:59 UTC
LW: 4 AF: 2
AF
Really fascinating problem! I like how your examples make me want to say “Well, the AI just has to ask about… wait a minute, that’s the problem!”. Taken from another point of view, you’re asking how and in which context can an AI reveal our utility functions, which means revealing our latent variables.
This problems also feels related to our discussion of the locality of goals. Here you assume a non-local goal (as most human ones are), and I think that a better knowledge of how to detect/measure locality from behavior and assumptions about the agent-model might help with the pointers problem.
- johnswentworth 21 Nov 2020 1:48 UTC
  LW: 4 AF: 2
  AF Parent
  Setting up the “locality of goals” concept: let’s split the variables in the world model into observables $X^{O}$ , action variables $X^{A}$ , and latent variables $X^{L}$ . Note that there may be multiple stages of observations and actions, so we’ll only have subsets $S_{O}$ and $S_{A}$ of the observation/action variables in the decision problem. The Bayesian utility maximizer then chooses $X_{S_{A}}^{A}$ to maximize
  $E [u (X) | X_{S_{O}}^{O}, d o (X_{S_{A}}^{A})]$
  … but we can rewrite that as
  $E [E_{X^{L}} [u (X) | X^{O}, X^{A}] | X_{S_{O}}^{O}, d o (X_{S_{A}}^{A})]$
  Defining a new utility function $u^{'} (X^{O}, X^{A}) = E_{X^{L}} [u (X) | X^{O}, X^{A}]$ , the original problem is equivalent to:
  $E [u^{'} (X^{O}, X^{A}) | X_{S_{O}}^{O}, d o (X_{S_{A}}^{A})]$
  In English: given the original utility function on the (“non-local”) latent variables, we can integrate out the latents to get a new utility function defined only on the (“local”) observation & decision variables. The new utility function yields completely identical agent behavior to the original.
  So observing agent behavior alone cannot possibly let us distinguish preferences on latent variables from preferences on the “local” observation & decision variables.