Rob Bensinger comments on AGI Ruin: A List of Lethalities

Rob Bensinger 8 Jun 2022 20:58 UTC
LW: 10 AF: 6
2
AF
I understand the first part of your comment as “sure, it’s possible for minds to care about reality, but we don’t know how to target value formation so that the mind cares about a particular part of reality.” Is this a good summary?
Yes!
I was, first, pointing out that this problem has to be solvable, since the human genome solves it millions of times every day!
True! Though everyone already agreed (e.g., EY asserted this in the OP) that it’s possible in principle. The updatey thing would be if the case of the human genome / brain development suggests it’s more tractable than we otherwise would have thought (in AI).
Seems to me like it’s at least a small update about tractability, though I’m not sure it’s a big one? Would be interesting to think about the level of agreement between different individual humans with regard to ‘how much particular external-world things matter’. Especially interesting would be cases where humans consistently, robustly care about a particular external-world thingie even though it doesn’t have a simple sensory correlate.
(E.g., humans developing to care about sex is less promising insofar as it depends on sensory-level reinforcement such as orgasms. Humans developing to care about ‘not being in the Matrix / not being in an experience machine’ is possibly more promising, because it seems like a pretty common preference that doesn’t get directly shaped by sensory rewards.)
3. Producing a mind which reliably terminally values a specific non-sensory entity, like diamonds
Is the distinction between 2 and 3 that “dog” is an imprecise concept, while “diamond” is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is ‘maximize the number of dogs’ and 3 is ‘maximize the number of diamonds’.
If you could reliably build a dog maximizer, I think that would also be a massive win and would maybe mean that the alignment problem is mostly-solved. (Indeed, I’m inclined to think that’s a harder feat than building a diamond maximizer, and I think being able to build a diamond maximizer would also suggest the strawberry-grade alignment problem is mostly solved.)
But maybe I’m misunderstanding 2.
Nope, wasn’t meaning any of these! I was talking about “causing the optimizer’s goals to point at things in the real world” the whole time.
Cool!
I’ll look more at your shards document and think about your arguments here. :)
- TurnTrout 9 Jun 2022 1:36 UTC
  LW: 6 AF: 4
  −1
  AF Parent
  Is the distinction between 2 and 3 that “dog” is an imprecise concept, while “diamond” is precise? FWIW, 2 and 3 currently sound very similar to me, if 2 is ‘maximize the number of dogs’ and 3 is ‘maximize the number of diamonds’.
  Feat #2 is: Design a mind which cares about anything at all in reality which isn’t a shallow sensory phenomenon which is directly observable by the agent. Like, maybe I have a mind-training procedure, where I don’t know what the final trained mind will value (dogs, diamonds, trees having particular kinds of cross-sections at year 5 of their growth), but I’m damn sure the AI will care about something besides its own sensory signals. Such a procedure would accomplish feat #2, but not #3.
  Feat #3 is: Design a mind which cares about a particular kind of object. We could target the mind-training process to care about diamonds, or about dogs, or about trees, but to solve this problem, we have to ensure the trained mind significantly cares about one kind of real-world entity in particular. Therefore, feat #3 is strictly harder than feat #2.
  If you could reliably build a dog maximizer, I think that would also be a massive win and would maybe mean that the alignment problem is mostly-solved. (Indeed, I’m inclined to think that’s a harder feat than building a diamond maximizer
  I actually think that the dog- and diamond-maximization problems are about equally hard, and, to be totally honest, neither seems that bad^[1] in the shard theory paradigm.
  Surprisingly, I weakly suspect the harder part is getting the agent to maximize real-world dogs in expectation, not getting the agent to maximize real-world dogs in expectation. I think “figure out how to build a mind which cares about the number of real-world dogs, such that the mind intelligently selects plans which lead to a lot of dogs” is significantly easier than building a dog-maximizer.
  1. ^
    I appreciate that this claim is hard to swallow. In any case, I want to focus on inferentially-closer questions first, like how human values form.