For sufficiently general and open-ended tasks, the AI will need to generalize like a human does in order to be safe. Otherwise, the default is to expect a (possibly existential) catastrophe from a benign failure.
This recent post is relevant to my thinking here. For the performance guarantee, you only care about what happens on the training distribution. For the control guarantee, “generalize like a human” doesn’t seem like the only strategy, or even an especially promising strategy.
I assume you think some different kind of guarantee is needed. My best guess is that you expect we’ll have a system that is trying to do what we want, but is very alien and unable to tell what kinds of mistakes might be catastrophic to us, and that there are enough opportunities for catastrophic error that it is likely to make one.
Let me know if that’s wrong.
If that’s right, I think the difference is: I see subtle benign catastrophic errors as quite rare, such that they are quantitatively a much smaller problem than what I’m calling AI alignment, whereas you seem to think they are extremely common. (Moreover, the benign catastrophic risks I see are also mostly things like “accidentally start a nuclear war,” for which “make sure the AI generalizes like a human” is not a especially great response. But I think that’s just because I’m not seeing some big class of benign catastrophic risks that seem obvious to you, so it’s just a restatement of the same difference.)
Could you explain a bit more what kind subtle benign mistake you expect to be catastrophic?
This recent post is relevant to my thinking here. For the performance guarantee, you only care about what happens on the training distribution. For the control guarantee, “generalize like a human” doesn’t seem like the only strategy, or even an especially promising strategy.
I assume you think some different kind of guarantee is needed. My best guess is that you expect we’ll have a system that is trying to do what we want, but is very alien and unable to tell what kinds of mistakes might be catastrophic to us, and that there are enough opportunities for catastrophic error that it is likely to make one.
Let me know if that’s wrong.
If that’s right, I think the difference is: I see subtle benign catastrophic errors as quite rare, such that they are quantitatively a much smaller problem than what I’m calling AI alignment, whereas you seem to think they are extremely common. (Moreover, the benign catastrophic risks I see are also mostly things like “accidentally start a nuclear war,” for which “make sure the AI generalizes like a human” is not a especially great response. But I think that’s just because I’m not seeing some big class of benign catastrophic risks that seem obvious to you, so it’s just a restatement of the same difference.)
Could you explain a bit more what kind subtle benign mistake you expect to be catastrophic?