Rohin Shah comments on The Value Definition Problem

Rohin Shah 25 Nov 2019 17:25 UTC
LW: 3 AF: 3
0
AF
To me, this reads like, ‘we have a particular AI, what should we try to get it to do’
Hmm, I definitely didn’t intend it that way—I’m basically always talking about how to build AI systems, and I’d hope my readers see it that way too. But in any case, adding three words isn’t a big deal, I’ll change that.
(Though I think it is “what should we get our AI system to try to do”, as opposed to “what should we try to get our AI system to do”, right? The former is intent alignment, the latter is not.)
even if you strongly favour indirect approaches you will still have to make some decisions about the nature of the delegation
In some abstract sense, certainly. But it could be “I’ll take no action; whatever future humanity decides on will be what happens”. This is in some sense a decision about the nature of the delegation, but not a huge one. (You could also imagine believing that delegating will be fine for a wide variety of delegation procedures, and so you aren’t too worried which one gets used.)
For example, perhaps we solve intent alignment in a value-neutral way (that is, the resulting AI system tries to figure out the values of its operator and then satisfy them, and can do so for most operators), and then every human gets an intent aligned AGI, this leads to a post-scarcity world, and then all of the future humans figure out what they as a society care about (the philosophical labor) and then that is optimized.
Of course, the philosophical labor did eventually happen, but the point is that it happened well after AGI, and pre-AGI nothing major needed to be done to delegate to the future humans.
- Sammy Martin 2 Dec 2019 16:13 UTC
  1 point
  0
  Parent
  The scenario where every human gets an intent-aligned AGI, and each AGI learns their own particular values would be a case where each individual AGI is following something like ‘Distilled Human Preferences’, or possibly just ‘Ambitious Learned Value Function’ as its Value Definition, so a fairly Direct scenario. However, the overall outcome would be more towards the indirect end—because a multipolar world with lots of powerful Humans using AGIs and trying to compromise would (you anticipate) end up converging on our CEV, or Moral Truth, or something similar. I didn’t consider direct vs indirect in the context of multipolar scenarios like this (nor did Bostrom, I think) but it seems sufficient to just say that the individual AGIs use a fairly direct Value Definition while the outcome is indirect.