cousin_it comments on Considerations on interaction between AI and expected value of the future

cousin_it 7 Dec 2021 10:53 UTC
LW: 7 AF: 3
AF
I think the default non-extinction outcome is a singleton with near miss at alignment creating large amounts of suffering.
- Vanessa Kosoy 7 Dec 2021 20:54 UTC
  LW: 17 AF: 8
  AF Parent
  I’m surprised. Unaligned AI is more likely than aligned AI even conditional on non-extinction? Why do you think that?
  - cousin_it 8 Dec 2021 0:14 UTC
    LW: 5 AF: 2
    AF Parent
    I think alignment is finicky, and there’s a “deep pit around the peak” as discussed here.
    - Vanessa Kosoy 8 Dec 2021 17:55 UTC
      LW: 13 AF: 6
      AF Parent
      I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:
      
      Goodharting some proxy, such as making the reward signal go on instead of satisfying the human’s request in order for the human to press the reward button. This usually produces a universe without people, since specifying a “person” is fairly complicated and the proxy will not be robustly tied to this concept.
      Allowing a daemon to take over. Daemonic utility function are probably completely alien and also produce a universe without people. One caveat is: maybe the daemon comes from a malign simulation hypothesis and the simulators are an evolved species so their values involve human-relevant concepts in some way. But it doesn’t seem all that likely. And, if it turns out to be true, then a daemonic universe might as well happen to be good.
      - cousin_it 9 Dec 2021 9:06 UTC
        LW: 3 AF: 1
        AF Parent
        These involve extinction, so they don’t answer the question what’s the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.
        Vanessa Kosoy 9 Dec 2021 9:29 UTC
        LW: 8 AF: 4
        AF Parent
        My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.
        cousin_it 9 Dec 2021 23:16 UTC
        LW: 2 AF: 1
        AF Parent
        To me it feels like alignment is a tiny target to hit, and around it there’s a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don’t see why that’s true.
        JBlack 11 Dec 2021 4:51 UTC
        1 point
        Parent
        I think where we differ is that I think Pr[full alignment] is extremely low, and there is quite a lot of space for non-omnicidal partial misalignment.