I definitely endorse this as a good explanation of the same pointers problem I was getting at. I particularly like the new framing in terms of a direct conflict between (a) the fact that what we care about can be seen as latent variables in our model, and (b) we value “actual states”, not our estimates—this seems like a new and better way of pointing out the problem (despite being very close in some sense to things Eliezer talked about in the sequences).
What I’d like to add to this post would be the point that we shouldn’t be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don’t think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don’t need to.)
One reason is because finding a correspondence and applying it isn’t what the agent should want. In this simple setup, where we suppose a perfect Bayesian agent, it’s reasonable to argue that the AI should just use the agent’s beliefs. That’s what would maximize the expectation from the perspective of the agent—not using the agent’s utility function but substituting the AI’s beliefs for the agent’s. You mention that the agent may not have a perfect world-model, but this isn’t a good argument from the agent’s perspective—certainly not an argument for just substituting the agent’s model with some AI world-model.
This can be a real alignment problem for the agent (not just a mistake made by an overly dogmatic agent): if the AI believes that the moon is made of blue cheese, but the agent doesn’t trust that belief, then the AI can make plans which the agent doesn’t trust even if the utility function is perfect.
And if the agent does trust the AI’s machine-learning-based model, then an AI which used the agent’s prior would also trust the machine-learning model. So, nothing is lost by designing the AI to use the agent’s prior in addition to its utility function.
So this is an argument that prior-learning is a part of alignment just as much as value-learning.
We don’t usually think this way because when it comes to humans, well, it sounds like a terrible idea. Human beliefs—as we encounter them in the wild—are radically broken and irrational, and inadequate to the task. I think that’s why I got a lot of push-back on my post about this:
I mean, I REALLY don’t want that or anything like that.
- jbash
But I think normativity gives us a different way of thinking about this. We don’t want the AI to use “the human prior” in the sense of some prior we can extract from human behavior, or extract from the brain, or whatever. Instead, what we want to use is “the human prior” in the normative sense—the prior humans reflectively endorse.
This gives us a path forward on the “impossible” cases where humans believe in ghosts, etc. It’s not as if humans don’t have experience dealing with things of value which turn out not to be a part of the real world. We’re constantly forming and reforming ontologies. The AI should be trying to learn how we deal with it—again, not quite in a descriptive sense of how humans actually deal with it, but rather in the normative sense of how we endorse dealing with it, so that it deals with it in ways we trust and prefer.
I had been weakly leaning towards the idea that a solution to the pointers problem should be a solution to deferral—i.e. it tells us when the agent defers to the AI’s world model, and what mapping it uses to translate AI-variables to agent-variables. This makes me lean more in that direction.
What I’d like to add to this post would be the point that we shouldn’t be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don’t think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don’t need to.)
I see a couple different claims mixed together here:
The metaphilosophical problem of how we “should” handle this problem is sufficient and/or necessary to solve in its own right.
There probably isn’t a general way to find correspondences between models, so we need to operate at the meta-level.
The main thing I disagree with is the idea that there probably isn’t a general way to find correspondences between models. There are clearly cases where correspondence fails outright (like the ghosts example), but I think the problem is probably solvable allowing for error-cases (by which I mean cases where the correspondence throws an error, not cases in which the correspondence returns an incorrect result). Furthermore, assuming that natural abstractions work the way I think they do, I think the problem is solvable in practice with relatively few error cases and potentially even using “prosaic” AI world-models. It’s the sort of thing which would dramatically improve the success chances of alignment by default.
I absolutely do agree that we still need the metaphilosophical stuff for a first-best solution. In particular, there is not an obviously-correct way to handle the correspondence error-cases, and of course anything else in the whole setup can also be close-but-not-exactly-right . I do think that combining a solution to the pointers problem with something like the communication prior strategy, plus some obvious tweaks like partially-ordered preferences and some model of logical uncertainty, would probably be enough to land us in the basin of convergence (assuming the starting model was decent), but even then I’d prefer metaphilosophical tools to be confident that something like that would work.
I definitely endorse this as a good explanation of the same pointers problem I was getting at. I particularly like the new framing in terms of a direct conflict between (a) the fact that what we care about can be seen as latent variables in our model, and (b) we value “actual states”, not our estimates—this seems like a new and better way of pointing out the problem (despite being very close in some sense to things Eliezer talked about in the sequences).
What I’d like to add to this post would be the point that we shouldn’t be imposing a solution from the outside. How to deal with this in an aligned way is itself something which depends on the preferences of the agent. I don’t think we can just come up with a general way to find correspondences between models, or something like that, and apply it to solve the problem. (Or at least, we don’t need to.)
One reason is because finding a correspondence and applying it isn’t what the agent should want. In this simple setup, where we suppose a perfect Bayesian agent, it’s reasonable to argue that the AI should just use the agent’s beliefs. That’s what would maximize the expectation from the perspective of the agent—not using the agent’s utility function but substituting the AI’s beliefs for the agent’s. You mention that the agent may not have a perfect world-model, but this isn’t a good argument from the agent’s perspective—certainly not an argument for just substituting the agent’s model with some AI world-model.
This can be a real alignment problem for the agent (not just a mistake made by an overly dogmatic agent): if the AI believes that the moon is made of blue cheese, but the agent doesn’t trust that belief, then the AI can make plans which the agent doesn’t trust even if the utility function is perfect.
And if the agent does trust the AI’s machine-learning-based model, then an AI which used the agent’s prior would also trust the machine-learning model. So, nothing is lost by designing the AI to use the agent’s prior in addition to its utility function.
So this is an argument that prior-learning is a part of alignment just as much as value-learning.
We don’t usually think this way because when it comes to humans, well, it sounds like a terrible idea. Human beliefs—as we encounter them in the wild—are radically broken and irrational, and inadequate to the task. I think that’s why I got a lot of push-back on my post about this:
But I think normativity gives us a different way of thinking about this. We don’t want the AI to use “the human prior” in the sense of some prior we can extract from human behavior, or extract from the brain, or whatever. Instead, what we want to use is “the human prior” in the normative sense—the prior humans reflectively endorse.
This gives us a path forward on the “impossible” cases where humans believe in ghosts, etc. It’s not as if humans don’t have experience dealing with things of value which turn out not to be a part of the real world. We’re constantly forming and reforming ontologies. The AI should be trying to learn how we deal with it—again, not quite in a descriptive sense of how humans actually deal with it, but rather in the normative sense of how we endorse dealing with it, so that it deals with it in ways we trust and prefer.
This makes a lot of sense.
I had been weakly leaning towards the idea that a solution to the pointers problem should be a solution to deferral—i.e. it tells us when the agent defers to the AI’s world model, and what mapping it uses to translate AI-variables to agent-variables. This makes me lean more in that direction.
I see a couple different claims mixed together here:
The metaphilosophical problem of how we “should” handle this problem is sufficient and/or necessary to solve in its own right.
There probably isn’t a general way to find correspondences between models, so we need to operate at the meta-level.
The main thing I disagree with is the idea that there probably isn’t a general way to find correspondences between models. There are clearly cases where correspondence fails outright (like the ghosts example), but I think the problem is probably solvable allowing for error-cases (by which I mean cases where the correspondence throws an error, not cases in which the correspondence returns an incorrect result). Furthermore, assuming that natural abstractions work the way I think they do, I think the problem is solvable in practice with relatively few error cases and potentially even using “prosaic” AI world-models. It’s the sort of thing which would dramatically improve the success chances of alignment by default.
I absolutely do agree that we still need the metaphilosophical stuff for a first-best solution. In particular, there is not an obviously-correct way to handle the correspondence error-cases, and of course anything else in the whole setup can also be close-but-not-exactly-right . I do think that combining a solution to the pointers problem with something like the communication prior strategy, plus some obvious tweaks like partially-ordered preferences and some model of logical uncertainty, would probably be enough to land us in the basin of convergence (assuming the starting model was decent), but even then I’d prefer metaphilosophical tools to be confident that something like that would work.