I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:
Goodharting some proxy, such as making the reward signal go on instead of satisfying the human’s request in order for the human to press the reward button. This usually produces a universe without people, since specifying a “person” is fairly complicated and the proxy will not be robustly tied to this concept.
Allowing a daemon to take over. Daemonic utility function are probably completely alien and also produce a universe without people. One caveat is: maybe the daemon comes from a malign simulation hypothesis and the simulators are an evolved species so their values involve human-relevant concepts in some way. But it doesn’t seem all that likely. And, if it turns out to be true, then a daemonic universe might as well happen to be good.
These involve extinction, so they don’t answer the question what’s the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.
My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.
To me it feels like alignment is a tiny target to hit, and around it there’s a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don’t see why that’s true.
I think the default non-extinction outcome is a singleton with near miss at alignment creating large amounts of suffering.
I’m surprised. Unaligned AI is more likely than aligned AI even conditional on non-extinction? Why do you think that?
I think alignment is finicky, and there’s a “deep pit around the peak” as discussed here.
I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:
Goodharting some proxy, such as making the reward signal go on instead of satisfying the human’s request in order for the human to press the reward button. This usually produces a universe without people, since specifying a “person” is fairly complicated and the proxy will not be robustly tied to this concept.
Allowing a daemon to take over. Daemonic utility function are probably completely alien and also produce a universe without people. One caveat is: maybe the daemon comes from a malign simulation hypothesis and the simulators are an evolved species so their values involve human-relevant concepts in some way. But it doesn’t seem all that likely. And, if it turns out to be true, then a daemonic universe might as well happen to be good.
These involve extinction, so they don’t answer the question what’s the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.
My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.
To me it feels like alignment is a tiny target to hit, and around it there’s a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don’t see why that’s true.
I think where we differ is that I think Pr[full alignment] is extremely low, and there is quite a lot of space for non-omnicidal partial misalignment.