Really fascinating problem! I like how your examples make me want to say “Well, the AI just has to ask about… wait a minute, that’s the problem!”. Taken from another point of view, you’re asking how and in which context can an AI reveal our utility functions, which means revealing our latent variables.
This problems also feels related to our discussion of the locality of goals. Here you assume a non-local goal (as most human ones are), and I think that a better knowledge of how to detect/measure locality from behavior and assumptions about the agent-model might help with the pointers problem.
Setting up the “locality of goals” concept: let’s split the variables in the world model into observables XO, action variables XA, and latent variables XL. Note that there may be multiple stages of observations and actions, so we’ll only have subsets SO and SA of the observation/action variables in the decision problem. The Bayesian utility maximizer then chooses XASA to maximize
E[u(X)|XOSO,do(XASA)]
… but we can rewrite that as
E[EXL[u(X)|XO,XA]|XOSO,do(XASA)]
Defining a new utility function u′(XO,XA)=EXL[u(X)|XO,XA], the original problem is equivalent to:
E[u′(XO,XA)|XOSO,do(XASA)]
In English: given the original utility function on the (“non-local”) latent variables, we can integrate out the latents to get a new utility function defined only on the (“local”) observation & decision variables. The new utility function yields completely identical agent behavior to the original.
So observing agent behavior alone cannot possibly let us distinguish preferences on latent variables from preferences on the “local” observation & decision variables.
Really fascinating problem! I like how your examples make me want to say “Well, the AI just has to ask about… wait a minute, that’s the problem!”. Taken from another point of view, you’re asking how and in which context can an AI reveal our utility functions, which means revealing our latent variables.
This problems also feels related to our discussion of the locality of goals. Here you assume a non-local goal (as most human ones are), and I think that a better knowledge of how to detect/measure locality from behavior and assumptions about the agent-model might help with the pointers problem.
Setting up the “locality of goals” concept: let’s split the variables in the world model into observables XO, action variables XA, and latent variables XL. Note that there may be multiple stages of observations and actions, so we’ll only have subsets SO and SA of the observation/action variables in the decision problem. The Bayesian utility maximizer then chooses XASA to maximize
E[u(X)|XOSO,do(XASA)]
… but we can rewrite that as
E[EXL[u(X)|XO,XA]|XOSO,do(XASA)]
Defining a new utility function u′(XO,XA)=EXL[u(X)|XO,XA], the original problem is equivalent to:
E[u′(XO,XA)|XOSO,do(XASA)]
In English: given the original utility function on the (“non-local”) latent variables, we can integrate out the latents to get a new utility function defined only on the (“local”) observation & decision variables. The new utility function yields completely identical agent behavior to the original.
So observing agent behavior alone cannot possibly let us distinguish preferences on latent variables from preferences on the “local” observation & decision variables.