Hm. Initially I would say that this approach tackles extremal Goodhart on epistemic uncertainty, but not on model/value uncertainty (and therefore this is a blueprint for an AI that thinks “the human’s utility function” is some real thing, and goes out and becomes arbitrarily certain that it’s found it). But maybe there’s something to do here.
Having model uncertainty is sort of like having epistemic uncertainty that you can’t resolve (though notably for epistemic uncertainty the aggregation function is fixed to be the weighted average, while for non-epistemic uncertainty it can be nonlinear). So I’d be interested to see you using mildly-optimizing agents as a testbed for what happens when your process for learning what to value can’t supply infinite independent bits. Especially from human feedback—can you think of ways for the AI to be able to query humans for feedback, and yet not be able to push uncertainty to zero via infinite feedback? Does this create perverse incentives? What do more/less principled ways of doing this look like?
I’ve probably misunderstood your comment, but I think this post already does most of what you are suggesting (except for the very last bit about including human feedback)? It doesn’t assume the human’s utility function is some real thing that it will update toward, it has a fixed distribution over utility throughout deployment. There’s no mechanism for updating that distribution, so it can’t become arbitrarily certain about the utility function.
And that distribution ϕ isn’t treated like epistemic uncertainty, it’s used to find a worst case lower bound on utility?
My bad, I commented without reading thoroughly and was wrong—I think I hallucinated that you were setting ϕ as a function of the toy data, rather than generating the toy data to visualize ϕ. Whoops!
Other things that came up when I actually read the whole thing:
It does seem like you’re always just going to get a rescaled subset of the prior distribution. My heuristic argument goes like: if your output distribution isn’t just a rescaled subset of the prior, then you can freely increase your score (without changing q) by moving probability mass from the lowest-score policy to higher-scoring policies that haven’t yet reached Lmax - this doesn’t change the EV of L because it’s precisely γL that’s the conserved probability mass, not L alone.
I still don’t understand the section on planning—there’s the classic quantilizer problem that policies made of individually likely actions may be unlikely in aggregate. If you have a true factorization of the space of policies this can be sidestepped, but doesn’t the same problem show up rephrased as “true, non-approximate, factorizations of the space of policies are rare?”
On it always being a rescaled subset: Nice! This explains the results of my empirical experiments. Jessica made a similar argument for why quantilizers are optimal, but I hadn’t gotten around to trying to adapt it to this slightly different situation. It makes sense now that the maximin distribution is like quantilizing against the value lower bound, except that the value lower bound changes if you change the minimax distribution. This explains why some of the distributions are exactly quantilizers but some not, it depends on whether that value lower bound drops lower than the start of the policy distribution.
On planning: Yeah it might be hard to factorize the final policy distribution. But I think it will be easy to approximately factorize the prior in lots of different ways. And I’m hopeful that we can prove that some approximate factorizations maintain the same q value, or maybe only have a small impact on the q value. Haven’t done any work on this yet.
If it turns out we need near-exact factorizations, we might still be able to use sampling techniques like rejection sampling to correct an approximate sampling distribution, because we have easy access to the correct density of samples that we have generated (just prior/q), we just need an approximate distribution to use for getting high value samples more often, which seems straightforward.
Especially from human feedback—can you think of ways for the AI to be able to query humans for feedback, and yet not be able to push uncertainty to zero via infinite feedback? Does this create perverse incentives? What do more/less principled ways of doing this look like?
My intuition right now is that in the infinite feedback case, it would be aligned, but not corrigible because we can specify everything exactly.
The problem is that if it thinks of the ground truth of morality as some fact that’s out there in the world that its supervisory signal has causal access to, it will figure out what corresponds to its “ground truth” in some accurate causal model of the world, and then it will try to optimize that directly.
E.g. you train the AI to get good ratings from humans, but the plan that actually gets maximum rating is one that interferes with the rating-process itself (e.g. by deceiving humans, or hacking the computer).
Of course there are some goals about the world that would be good for an AI to learn—we just don’t know how to write down how to learn them.
The problem is that if it thinks of the ground truth of morality as some fact that’s out there in the world that its supervisory signal has causal access to, it will figure out what corresponds to its “ground truth” in some accurate causal model of the world, and then it will try to optimize that directly.
My critical point is that the ground truth may not actually exist here, so morals are only definable relative to what an agent wants, also called moral anti-realism.
This does introduce a complication, in that manipulation would be effectively impossible to avoid, since it’s effectively arbitrarily controlled. This is actually dangerous, since deceiving a person and helping a person morally blur so easily, if not outright equivalent, and if the infinite limit is not actually aligned, this is a dangerous problem.
Hm. Initially I would say that this approach tackles extremal Goodhart on epistemic uncertainty, but not on model/value uncertainty (and therefore this is a blueprint for an AI that thinks “the human’s utility function” is some real thing, and goes out and becomes arbitrarily certain that it’s found it). But maybe there’s something to do here.
Having model uncertainty is sort of like having epistemic uncertainty that you can’t resolve (though notably for epistemic uncertainty the aggregation function is fixed to be the weighted average, while for non-epistemic uncertainty it can be nonlinear). So I’d be interested to see you using mildly-optimizing agents as a testbed for what happens when your process for learning what to value can’t supply infinite independent bits. Especially from human feedback—can you think of ways for the AI to be able to query humans for feedback, and yet not be able to push uncertainty to zero via infinite feedback? Does this create perverse incentives? What do more/less principled ways of doing this look like?
I’ve probably misunderstood your comment, but I think this post already does most of what you are suggesting (except for the very last bit about including human feedback)? It doesn’t assume the human’s utility function is some real thing that it will update toward, it has a fixed distribution over utility throughout deployment. There’s no mechanism for updating that distribution, so it can’t become arbitrarily certain about the utility function.
And that distribution ϕ isn’t treated like epistemic uncertainty, it’s used to find a worst case lower bound on utility?
My bad, I commented without reading thoroughly and was wrong—I think I hallucinated that you were setting ϕ as a function of the toy data, rather than generating the toy data to visualize ϕ. Whoops!
Other things that came up when I actually read the whole thing:
It does seem like you’re always just going to get a rescaled subset of the prior distribution. My heuristic argument goes like: if your output distribution isn’t just a rescaled subset of the prior, then you can freely increase your score (without changing q) by moving probability mass from the lowest-score policy to higher-scoring policies that haven’t yet reached Lmax - this doesn’t change the EV of L because it’s precisely γL that’s the conserved probability mass, not L alone.
I still don’t understand the section on planning—there’s the classic quantilizer problem that policies made of individually likely actions may be unlikely in aggregate. If you have a true factorization of the space of policies this can be sidestepped, but doesn’t the same problem show up rephrased as “true, non-approximate, factorizations of the space of policies are rare?”
On it always being a rescaled subset: Nice! This explains the results of my empirical experiments. Jessica made a similar argument for why quantilizers are optimal, but I hadn’t gotten around to trying to adapt it to this slightly different situation. It makes sense now that the maximin distribution is like quantilizing against the value lower bound, except that the value lower bound changes if you change the minimax distribution. This explains why some of the distributions are exactly quantilizers but some not, it depends on whether that value lower bound drops lower than the start of the policy distribution.
On planning: Yeah it might be hard to factorize the final policy distribution. But I think it will be easy to approximately factorize the prior in lots of different ways. And I’m hopeful that we can prove that some approximate factorizations maintain the same q value, or maybe only have a small impact on the q value. Haven’t done any work on this yet.
If it turns out we need near-exact factorizations, we might still be able to use sampling techniques like rejection sampling to correct an approximate sampling distribution, because we have easy access to the correct density of samples that we have generated (just prior/q), we just need an approximate distribution to use for getting high value samples more often, which seems straightforward.
My intuition right now is that in the infinite feedback case, it would be aligned, but not corrigible because we can specify everything exactly.
The problem is that if it thinks of the ground truth of morality as some fact that’s out there in the world that its supervisory signal has causal access to, it will figure out what corresponds to its “ground truth” in some accurate causal model of the world, and then it will try to optimize that directly.
E.g. you train the AI to get good ratings from humans, but the plan that actually gets maximum rating is one that interferes with the rating-process itself (e.g. by deceiving humans, or hacking the computer).
Of course there are some goals about the world that would be good for an AI to learn—we just don’t know how to write down how to learn them.
My critical point is that the ground truth may not actually exist here, so morals are only definable relative to what an agent wants, also called moral anti-realism.
This does introduce a complication, in that manipulation would be effectively impossible to avoid, since it’s effectively arbitrarily controlled. This is actually dangerous, since deceiving a person and helping a person morally blur so easily, if not outright equivalent, and if the infinite limit is not actually aligned, this is a dangerous problem.
Why does the infinite limit of value learning matter if we’re doing soft optimization against a fixed utility distribution?
Sorry, I didn’t realize this and I was responding independently to Charlie Steiner.