I’ve probably misunderstood your comment, but I think this post already does most of what you are suggesting (except for the very last bit about including human feedback)? It doesn’t assume the human’s utility function is some real thing that it will update toward, it has a fixed distribution over utility throughout deployment. There’s no mechanism for updating that distribution, so it can’t become arbitrarily certain about the utility function.
And that distribution ϕ isn’t treated like epistemic uncertainty, it’s used to find a worst case lower bound on utility?
My bad, I commented without reading thoroughly and was wrong—I think I hallucinated that you were setting ϕ as a function of the toy data, rather than generating the toy data to visualize ϕ. Whoops!
Other things that came up when I actually read the whole thing:
It does seem like you’re always just going to get a rescaled subset of the prior distribution. My heuristic argument goes like: if your output distribution isn’t just a rescaled subset of the prior, then you can freely increase your score (without changing q) by moving probability mass from the lowest-score policy to higher-scoring policies that haven’t yet reached Lmax - this doesn’t change the EV of L because it’s precisely γL that’s the conserved probability mass, not L alone.
I still don’t understand the section on planning—there’s the classic quantilizer problem that policies made of individually likely actions may be unlikely in aggregate. If you have a true factorization of the space of policies this can be sidestepped, but doesn’t the same problem show up rephrased as “true, non-approximate, factorizations of the space of policies are rare?”
On it always being a rescaled subset: Nice! This explains the results of my empirical experiments. Jessica made a similar argument for why quantilizers are optimal, but I hadn’t gotten around to trying to adapt it to this slightly different situation. It makes sense now that the maximin distribution is like quantilizing against the value lower bound, except that the value lower bound changes if you change the minimax distribution. This explains why some of the distributions are exactly quantilizers but some not, it depends on whether that value lower bound drops lower than the start of the policy distribution.
On planning: Yeah it might be hard to factorize the final policy distribution. But I think it will be easy to approximately factorize the prior in lots of different ways. And I’m hopeful that we can prove that some approximate factorizations maintain the same q value, or maybe only have a small impact on the q value. Haven’t done any work on this yet.
If it turns out we need near-exact factorizations, we might still be able to use sampling techniques like rejection sampling to correct an approximate sampling distribution, because we have easy access to the correct density of samples that we have generated (just prior/q), we just need an approximate distribution to use for getting high value samples more often, which seems straightforward.
I’ve probably misunderstood your comment, but I think this post already does most of what you are suggesting (except for the very last bit about including human feedback)? It doesn’t assume the human’s utility function is some real thing that it will update toward, it has a fixed distribution over utility throughout deployment. There’s no mechanism for updating that distribution, so it can’t become arbitrarily certain about the utility function.
And that distribution ϕ isn’t treated like epistemic uncertainty, it’s used to find a worst case lower bound on utility?
My bad, I commented without reading thoroughly and was wrong—I think I hallucinated that you were setting ϕ as a function of the toy data, rather than generating the toy data to visualize ϕ. Whoops!
Other things that came up when I actually read the whole thing:
It does seem like you’re always just going to get a rescaled subset of the prior distribution. My heuristic argument goes like: if your output distribution isn’t just a rescaled subset of the prior, then you can freely increase your score (without changing q) by moving probability mass from the lowest-score policy to higher-scoring policies that haven’t yet reached Lmax - this doesn’t change the EV of L because it’s precisely γL that’s the conserved probability mass, not L alone.
I still don’t understand the section on planning—there’s the classic quantilizer problem that policies made of individually likely actions may be unlikely in aggregate. If you have a true factorization of the space of policies this can be sidestepped, but doesn’t the same problem show up rephrased as “true, non-approximate, factorizations of the space of policies are rare?”
On it always being a rescaled subset: Nice! This explains the results of my empirical experiments. Jessica made a similar argument for why quantilizers are optimal, but I hadn’t gotten around to trying to adapt it to this slightly different situation. It makes sense now that the maximin distribution is like quantilizing against the value lower bound, except that the value lower bound changes if you change the minimax distribution. This explains why some of the distributions are exactly quantilizers but some not, it depends on whether that value lower bound drops lower than the start of the policy distribution.
On planning: Yeah it might be hard to factorize the final policy distribution. But I think it will be easy to approximately factorize the prior in lots of different ways. And I’m hopeful that we can prove that some approximate factorizations maintain the same q value, or maybe only have a small impact on the q value. Haven’t done any work on this yet.
If it turns out we need near-exact factorizations, we might still be able to use sampling techniques like rejection sampling to correct an approximate sampling distribution, because we have easy access to the correct density of samples that we have generated (just prior/q), we just need an approximate distribution to use for getting high value samples more often, which seems straightforward.