steven0461 comments on Agents That Learn From Human Behavior Can’t Learn Human Values That Humans Haven’t Learned Yet

steven0461 1 Jun 2019 18:18 UTC
LW: 3 AF: 2
0
AF
I meant to assume that away:
But we’ll assume that her information stays the same while her utility function is being inferred, and she’s not doing anything to get more; perhaps she’s not in a position to.
In cases where you’re not in a position to get more information about your utility function (e.g. because the humans you’re interacting with don’t know the answer), your behavior won’t depend on whether or not you think it would be useful to have more information about your utility function, so someone observing your behavior can’t infer the latter from the former.
Maybe practical cases aren’t like this, but it seems to me like they’d only have to be like this with respect to at least one aspect of the utility function for it to be a problem.
Paul above seems to think it would be possible to reason from actual behavior to counterfactual behavior anyway, I guess because he’s thinking in terms of modeling the agent as a physical system and not just as an agent, but I’m confused about that so I haven’t responded and I don’t claim he’s wrong.
- Rohin Shah 2 Jun 2019 17:06 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Oh yeah, I agree with Paul’s comment and it’s saying the same thing as what I’m saying. Didn’t see it because I was reading on the Alignment Forum instead of LessWrong. I’ve moved that comment to the Alignment Forum now.