For example, one response to the issues with RLHF is “Ah, we need a better model of human irrationality”. In other words, we should do a lot of cognitive science to figure out precisely in which ways humans give feedback that doesn’t reflect their true preferences. Then we can back out true human preferences from irrational human feedback. But even if we got this to work, the true human preferences would have the wrong type signature, and we’d need to find a good ontology translation again.
Well, technically, since we’re observing πhuman=dhuman(Uhuman) (where d is the planner) anyways, a model of irrationality could just actually “include the translation” to UAI. Specifically, we could also factor it as: d′human/AI(UAI)=dhuman(UAI∘τ−1)=πhuman. So there’s a trivial sense in which “better models of human irrationality” are sufficient. :^)
For example, if the human doesn’t know that the AI has stolen the diamond from the vault in the ELK problem, and gives a high reward to a hacked video feed, we can say this is an ontology identification problem, because the human doesn’t understand what the AI knows. But it’s “equally” valid to imagine that the human has a reward function grounded in the actual world and their planner is just biased to give high reward when the video feed shows a diamond!
That being said, (IMO) a lot of the hope for “better models of human rationality” is to find conditions under which humans are less irrational or more informative. And I’d guess that you need to do AI-assisted cognitive science to ground the human biases into the AI’s world model. So I’m not sure there’s much difference in practice?
Well, technically, since we’re observing πhuman=dhuman(Uhuman) (where d is the planner) anyways, a model of irrationality could just actually “include the translation” to UAI. Specifically, we could also factor it as: d′human/AI(UAI)=dhuman(UAI∘τ−1)=πhuman. So there’s a trivial sense in which “better models of human irrationality” are sufficient. :^)
Ah yes, the trivial model of humans that says “whatever they do, that’s what they want.”
Is there some argument similar to a type signature argument that would rule out such poorly-generalizing approaches, though?
Like, clearly what’s going wrong is that these things have identical input-output types—but the trivial model is learned in a totally different way, so you could just track types through… well, okay, through a complete specification of value learning. I guess such an argument wouldn’t be very useful, though, because it would be too entangled with the specific program used to do value learning, so it wouldn’t be very good at proving that noother type would count as a solution.
Ah yes, the trivial model of humans that says “whatever they do, that’s what they want.”
That’s a different type of trivial solution, but the idea is broadly similar—because the space of reasonable planners is so large, there’s lots of ways to slice up the policy into reward + bias. (As Paul puts it, the easy reward inference problem is still hard.)
Is there some argument similar to a type signature argument that would rule out such poorly-generalizing approaches, though?
I don’t think a type signature would help: the problem is that the class of planners/biases and the class of rewards are way too unconstrained. We probably need to just better understand what human planners are via cognitive science (or do a better translation between AI concepts and Human concepts).
That being said, there’s been a paper that I’ve put off writing for two years now, on why the Armstrong/Mindermann NFL theorem is kind of silly and how to get around it, maybe I will write it at some point :)
I’m less sure exactly how broadly applicable the argument is. For example, I haven’t thought as much about approaches that rely much more on corrigibility and less on having a good model of human values, though my best guess is that something similar still applies. My main goal here was to get across any intuition at all for why we might expect ontology identification to be a central obstacle.
My guess is it applies way less—a lot of the hope is that the basin of corrigibility is “large enough” that it should be obvious what to do, even if you can’t infer the whole human value function.
Great post!
If you haven’t seen it yet, it might be worth taking a look at the discussion on Ontology Identification on Arbital.
One nitpicky comment:
Well, technically, since we’re observing πhuman=dhuman(Uhuman) (where d is the planner) anyways, a model of irrationality could just actually “include the translation” to UAI. Specifically, we could also factor it as: d′human/AI(UAI)=dhuman(UAI∘τ−1)=πhuman. So there’s a trivial sense in which “better models of human irrationality” are sufficient. :^)
For example, if the human doesn’t know that the AI has stolen the diamond from the vault in the ELK problem, and gives a high reward to a hacked video feed, we can say this is an ontology identification problem, because the human doesn’t understand what the AI knows. But it’s “equally” valid to imagine that the human has a reward function grounded in the actual world and their planner is just biased to give high reward when the video feed shows a diamond!
That being said, (IMO) a lot of the hope for “better models of human rationality” is to find conditions under which humans are less irrational or more informative. And I’d guess that you need to do AI-assisted cognitive science to ground the human biases into the AI’s world model. So I’m not sure there’s much difference in practice?
Ah yes, the trivial model of humans that says “whatever they do, that’s what they want.”
Is there some argument similar to a type signature argument that would rule out such poorly-generalizing approaches, though?
Like, clearly what’s going wrong is that these things have identical input-output types—but the trivial model is learned in a totally different way, so you could just track types through… well, okay, through a complete specification of value learning. I guess such an argument wouldn’t be very useful, though, because it would be too entangled with the specific program used to do value learning, so it wouldn’t be very good at proving that noother type would count as a solution.
That’s a different type of trivial solution, but the idea is broadly similar—because the space of reasonable planners is so large, there’s lots of ways to slice up the policy into reward + bias. (As Paul puts it, the easy reward inference problem is still hard.)
I don’t think a type signature would help: the problem is that the class of planners/biases and the class of rewards are way too unconstrained. We probably need to just better understand what human planners are via cognitive science (or do a better translation between AI concepts and Human concepts).
That being said, there’s been a paper that I’ve put off writing for two years now, on why the Armstrong/Mindermann NFL theorem is kind of silly and how to get around it, maybe I will write it at some point :)
My guess is it applies way less—a lot of the hope is that the basin of corrigibility is “large enough” that it should be obvious what to do, even if you can’t infer the whole human value function.