Jan_Kulveit comments on Conclusion to the sequence on value learning

Jan_Kulveit 4 Feb 2019 15:20 UTC
6 points
Just a few comments
- In the abstract, one open problem about “not-goal directed agents” is “when they turn into goal directed?”; this seems to be similar to the problem of inner optimizers, at least in the direction that solutions which would prevent the emergence of inner optimizers could likely work for non-goal directed things
- From the “alternative solutions”, in my view, what is under-investigated are attempts to limit capabilities—make “bounded agents”. One intuition behind it is that humans are functional just because goals and utilities are “broken” in a way compatible with our planning and computational bounds. I’m worried that efforts in this direction got bucketed with “boxing”, and boxing got some vibe as being uncool. (By making something bounded I mean for example making bit-flips costly in a way which is tied to physics, not naive solutions like “just don’t connect it to the internet”)
- I’m particularly happy about your points on the standard claims about expected utility maximization. My vague impression is too many people on LW kind of read the standard texts, take note that there is a persuasive text from Eliezer on a topic, and take the matter as settled.
- Rohin Shah 5 Feb 2019 19:44 UTC
  3 points
  Parent
  In the abstract, one open problem about “not-goal directed agents” is “when they turn into goal directed?”; this seems to be similar to the problem of inner optimizers, at least in the direction that solutions which would prevent the emergence of inner optimizers could likely work for non-goal directed things
  I agree that inner optimizers are a way that non-goal directed agents can become goal directed. I don’t see why solutions to inner optimizers would help align non goal-directed things. Can you say more about that?
  From the “alternative solutions”, in my view, what is under-investigated are attempts to limit capabilities—make “bounded agents”. One intuition behind it is that humans are functional just because goals and utilities are “broken” in a way compatible with our planning and computational bounds. I’m worried that efforts in this direction got bucketed with “boxing”, and boxing got some vibe as being uncool. (By making something bounded I mean for example making bit-flips costly in a way which is tied to physics, not naive solutions like “just don’t connect it to the internet”)
  I am somewhat worried about such approaches, because it seems hard to make such agents competitive with unaligned agents. But I agree that it seems under-investigated.
  I’m particularly happy about your points on the standard claims about expected utility maximization.
  Thanks!