This post states a subproblem of AI alignment which the author calls “the pointers problem”. The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user’s model to the AI’s model? This is very similar to the “ontological crisis” problem described by De Blanc, only De Blanc uses POMDPs instead of causal networks, and frames it in terms of a single agent changing their ontology, rather than translation from user to AI.
The question the author asks here is important, but not that novel (the author himself cites Demski as prior work). Perhaps the use of causal networks is a better angle, but this post doesn’t do much to show it. Even so, having another exposition of an important topic, with different points of emphasis, will probably benefit many readers.
The primary aspect missing from the discussion in the post, in my opinion, is the nature of the user as a learning agent. The user doesn’t have a fixed world-model: or, if they do, then this model is best seen as a prior. This observation hints at the resolution of the apparent paradox wherein the utility function is defined in terms of a wrong model. But it still requires us to explain how the utility is defined s.t. it is applicable to every hypothesis in the prior.
(What follows is no longer a “review” per se, inasmuch as a summary of my own thoughts on the topic.)
Here is a formal model of how a utility function for learning agents can work, when it depends on latent variables.
Fix A a set of actions and O a set of observations. We start with an ontological model which is a crisp infra-POMPD. That is, there is a set of states Sont, an initial state s0ont∈Sont, a transition infra-kernel Tont:Sont×A→□(Sont×O) and a reward function r:Sont→R. Here, □X stands for closed convex sets of probability distributions on X. In other words, this a POMDP with an underspecified transition kernel.
We then build a prior which consists of refinements of the ontological model. That is, each hypothesis in the prior is an infra-POMDP with state space S, initial state s0∈S, transition infra-kernel T:S×A→□(S×O) and an interpretation mappingι:S→Sont which is a morphism of infra-POMDPs (i.e.ι(s0)=s0ont and the obvious diagram of transition infra-kernels commutes). The reward function on S is just the composition r∘ι. Notice that while the ontological model must be an infra-POMDP to get a non-degenerate learning agent (moreover, it can be desirable to make it non-dogmatic about observables in some formal sense), the hypotheses in the prior can also be ordinary (Baysian) POMDPs.
Given such a prior plus a time discount function, we can consider the corresponding infra-Bayesian agent (or even just Bayesian agent if we chose all hypothesis to be Bayesian). Such an agent optimizes rewards which depend on latent variables, even though it does not know the correct world-model in advance. It does fit the world to the immutable ontological model (which is necessary to make sense of the latent variables to which the reward function refers), but the ontological model has enough freedom to accommodate many possible worlds.
The next question is then how would we transfer such a utility function from the user to the AI. Here, like noted by Demski, we want the AI to use not just the user’s utility function but also the user’s prior. Because, we want running such an AI to be rational from the subjective perspective of the user. This creates a puzzle: if the AI is using the same prior, and the user behaves nearly-optimally for their own prior (since otherwise how would we even infer the utility function and prior), how can the AI outperform the user?
The answer, I think, is via the AI having different action/observation channels from the user. At first glance this might seem unsatisfactory: we expect the AI to be “smarter”, not just to have better peripherals. However, using Turing RL we can represent the former as a special case of the latter. Specifically, part of the additional peripherals is access to a programmable computer, which effectively gives the AI a richer hypothesis space than the user.
The formalism I outlined here leaves many questions, for example what kind of learning guarantees to expect in the face of possible ambiguities between observationally indistinguishable hypothesis[1]. Nevertheless, I think it creates a convenient framework for studying the question raised in the post. A potential different approach is using infra-Bayesian physicalism, which also describes agents with utility functions that depend on latent variables. However, it is unclear whether it’s reasonable to apply the later to humans.
This post states a subproblem of AI alignment which the author calls “the pointers problem”. The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user’s model to the AI’s model? This is very similar to the “ontological crisis” problem described by De Blanc, only De Blanc uses POMDPs instead of causal networks, and frames it in terms of a single agent changing their ontology, rather than translation from user to AI.
The question the author asks here is important, but not that novel (the author himself cites Demski as prior work). Perhaps the use of causal networks is a better angle, but this post doesn’t do much to show it. Even so, having another exposition of an important topic, with different points of emphasis, will probably benefit many readers.
The primary aspect missing from the discussion in the post, in my opinion, is the nature of the user as a learning agent. The user doesn’t have a fixed world-model: or, if they do, then this model is best seen as a prior. This observation hints at the resolution of the apparent paradox wherein the utility function is defined in terms of a wrong model. But it still requires us to explain how the utility is defined s.t. it is applicable to every hypothesis in the prior.
(What follows is no longer a “review” per se, inasmuch as a summary of my own thoughts on the topic.)
Here is a formal model of how a utility function for learning agents can work, when it depends on latent variables.
Fix A a set of actions and O a set of observations. We start with an ontological model which is a crisp infra-POMPD. That is, there is a set of states Sont, an initial state s0ont∈Sont, a transition infra-kernel Tont:Sont×A→□(Sont×O) and a reward function r:Sont→R. Here, □X stands for closed convex sets of probability distributions on X. In other words, this a POMDP with an underspecified transition kernel.
We then build a prior which consists of refinements of the ontological model. That is, each hypothesis in the prior is an infra-POMDP with state space S, initial state s0∈S, transition infra-kernel T:S×A→□(S×O) and an interpretation mapping ι:S→Sont which is a morphism of infra-POMDPs (i.e.ι(s0)=s0ont and the obvious diagram of transition infra-kernels commutes). The reward function on S is just the composition r∘ι. Notice that while the ontological model must be an infra-POMDP to get a non-degenerate learning agent (moreover, it can be desirable to make it non-dogmatic about observables in some formal sense), the hypotheses in the prior can also be ordinary (Baysian) POMDPs.
Given such a prior plus a time discount function, we can consider the corresponding infra-Bayesian agent (or even just Bayesian agent if we chose all hypothesis to be Bayesian). Such an agent optimizes rewards which depend on latent variables, even though it does not know the correct world-model in advance. It does fit the world to the immutable ontological model (which is necessary to make sense of the latent variables to which the reward function refers), but the ontological model has enough freedom to accommodate many possible worlds.
The next question is then how would we transfer such a utility function from the user to the AI. Here, like noted by Demski, we want the AI to use not just the user’s utility function but also the user’s prior. Because, we want running such an AI to be rational from the subjective perspective of the user. This creates a puzzle: if the AI is using the same prior, and the user behaves nearly-optimally for their own prior (since otherwise how would we even infer the utility function and prior), how can the AI outperform the user?
The answer, I think, is via the AI having different action/observation channels from the user. At first glance this might seem unsatisfactory: we expect the AI to be “smarter”, not just to have better peripherals. However, using Turing RL we can represent the former as a special case of the latter. Specifically, part of the additional peripherals is access to a programmable computer, which effectively gives the AI a richer hypothesis space than the user.
The formalism I outlined here leaves many questions, for example what kind of learning guarantees to expect in the face of possible ambiguities between observationally indistinguishable hypothesis[1]. Nevertheless, I think it creates a convenient framework for studying the question raised in the post. A potential different approach is using infra-Bayesian physicalism, which also describes agents with utility functions that depend on latent variables. However, it is unclear whether it’s reasonable to apply the later to humans.
See also my article “RL with imperceptible rewards”