Well, technically, since we’re observing πhuman=dhuman(Uhuman) (where d is the planner) anyways, a model of irrationality could just actually “include the translation” to UAI. Specifically, we could also factor it as: d′human/AI(UAI)=dhuman(UAI∘τ−1)=πhuman. So there’s a trivial sense in which “better models of human irrationality” are sufficient. :^)
Ah yes, the trivial model of humans that says “whatever they do, that’s what they want.”
Is there some argument similar to a type signature argument that would rule out such poorly-generalizing approaches, though?
Like, clearly what’s going wrong is that these things have identical input-output types—but the trivial model is learned in a totally different way, so you could just track types through… well, okay, through a complete specification of value learning. I guess such an argument wouldn’t be very useful, though, because it would be too entangled with the specific program used to do value learning, so it wouldn’t be very good at proving that noother type would count as a solution.
Ah yes, the trivial model of humans that says “whatever they do, that’s what they want.”
That’s a different type of trivial solution, but the idea is broadly similar—because the space of reasonable planners is so large, there’s lots of ways to slice up the policy into reward + bias. (As Paul puts it, the easy reward inference problem is still hard.)
Is there some argument similar to a type signature argument that would rule out such poorly-generalizing approaches, though?
I don’t think a type signature would help: the problem is that the class of planners/biases and the class of rewards are way too unconstrained. We probably need to just better understand what human planners are via cognitive science (or do a better translation between AI concepts and Human concepts).
That being said, there’s been a paper that I’ve put off writing for two years now, on why the Armstrong/Mindermann NFL theorem is kind of silly and how to get around it, maybe I will write it at some point :)
Ah yes, the trivial model of humans that says “whatever they do, that’s what they want.”
Is there some argument similar to a type signature argument that would rule out such poorly-generalizing approaches, though?
Like, clearly what’s going wrong is that these things have identical input-output types—but the trivial model is learned in a totally different way, so you could just track types through… well, okay, through a complete specification of value learning. I guess such an argument wouldn’t be very useful, though, because it would be too entangled with the specific program used to do value learning, so it wouldn’t be very good at proving that noother type would count as a solution.
That’s a different type of trivial solution, but the idea is broadly similar—because the space of reasonable planners is so large, there’s lots of ways to slice up the policy into reward + bias. (As Paul puts it, the easy reward inference problem is still hard.)
I don’t think a type signature would help: the problem is that the class of planners/biases and the class of rewards are way too unconstrained. We probably need to just better understand what human planners are via cognitive science (or do a better translation between AI concepts and Human concepts).
That being said, there’s been a paper that I’ve put off writing for two years now, on why the Armstrong/Mindermann NFL theorem is kind of silly and how to get around it, maybe I will write it at some point :)