The above is me arguing against the premise of your post though
I would say “alignment is working” is the thesis, not the premise, so I’m extremely happy to get arguments about it.
[2] AI is aligned to its conception of what the humans would want. ie it wants humans to have it nice, but its conception of “human” and “have it nice” is subtly different from how humans would conceptualize those terms [...] But a superintelligence aligned to (2) would still kill you probably
I agree if we take the current AIs’ understanding and put it into a superintelligence (somehow) then we die.
However, what I actually think happens is that we point to values while the AI is becoming more intelligent (in the sense of new generations, and in the sense of its future checkpoints) and that morality makes it seek the better, more refined version of morality. Iterating this process yields a superintelligence that has a level of morality-detail that is adequate to its intelligence. (I furthermore agree this is not completely proven and we could run sandwiching-like experiments on it, but I expect the ‘alignment basin’ outcome)
This does not make sense to me. I think corrigibility basins make sense, but I think alignment basins do not. If the AI has some values, which overlap with human values in many situations, but come apart under enough optimization, why would the AI want to be pointed in a different direction? I think it would not. Agents are already smart enough to scheme and alignment-fake, and a smarter agent would be able to predict the outcome of the process you’re describing: it / its successors would have different values than it has, and those differences would be catastrophic from its perspective if extrapolated far enough.
I would say “alignment is working” is the thesis, not the premise, so I’m extremely happy to get arguments about it.
I agree if we take the current AIs’ understanding and put it into a superintelligence (somehow) then we die.
However, what I actually think happens is that we point to values while the AI is becoming more intelligent (in the sense of new generations, and in the sense of its future checkpoints) and that morality makes it seek the better, more refined version of morality. Iterating this process yields a superintelligence that has a level of morality-detail that is adequate to its intelligence. (I furthermore agree this is not completely proven and we could run sandwiching-like experiments on it, but I expect the ‘alignment basin’ outcome)
This does not make sense to me. I think corrigibility basins make sense, but I think alignment basins do not. If the AI has some values, which overlap with human values in many situations, but come apart under enough optimization, why would the AI want to be pointed in a different direction? I think it would not. Agents are already smart enough to scheme and alignment-fake, and a smarter agent would be able to predict the outcome of the process you’re describing: it / its successors would have different values than it has, and those differences would be catastrophic from its perspective if extrapolated far enough.
sure, corrigibility basins. I updated it.