The one good thing about the [nature of the] technical problem of alignment is that it makes hyperexistential risks — the risks of astronomical suffering — very unlikely.
The problem of AI Alignment can be viewed as the problem of encoding our preferences into an AGI, bit by bit. The strength of alignment tools, in turn, translates to how many bits we can encode. With the current methods of end-to-end training, we’re essentially sampling preferences at random. Perfect interpretability and parameter-surgery tools would allow us to encode an arbitrary amount of bits. The tools we’ll actually have will be somewhere between these two extremes.
“Build us our perfect world” is a very complicated ask, and it surely takes up many, many thousands of bits. That’s why alignment is hard.
“Build us a hell” is its mirror. It’s essentially the same ask, except for a flipped sign. As such, specifying it would require pretty much the same amount of bits.
Thus, in the timelines where we have alignment tools advanced enough to build a hell-making AGI, it’s overwhelmingly likely that we have the technical capability to build an utopia-building AGI. On the flipside, conditioning on our inability to build an utopia-builder, our tools are probably so bad we can’t come close to a hell-builder. In that case, we just sample some random preferences, and the AGI kills us quickly and painlessly.
Screwing up so badly we create a suffering-maximizer is vanishingly unlikely: it’s only possible in a very, very narrow range of technical capabilities.
I am worried about S-risks though. I think they’re pretty likely in timelines where we solve the technical problem of alignment, but the technology ends up in the wrong hands; central examples being xenophobic or authoritarian political entities.
I’m concerned it may be neglected, too: I expect the various AI Governance/field-building initiatives may not be spending any time considering how not to attract the wrong kind of attention, and instead simply maximize for getting as much attention as possible. (Though I suppose if they’re competent at it, I wouldn’t see any public evidence of them considering that; I’m just guessing on priors.)
Edit: Mm, though there’s a caveat. I’m operating under the least forgiving model of the alignment problem; under it, S-risks really are that unlikely. But many people don’t share it — e. g., the shard theory assumes “rough” alignment will suffice to avoid omnicide — which should make their P(hell) non-negligibly high. Yet they’re not worried either, so there must be something else going on with their models.
Thx! Yep, your edit basically captures most of what I would reply. If alignment turns out so hard that we can’t get any semblance of human values encoded at all, then I’d also guess that hell is quite unlikely. But there are caveats, e.g., if there is a nonobvious inner alignment failure, we could get a system that technically doesn’t care about any semblance of human values but doesn’t make that apparent because ostensibly optimizing for human values appears useful for it at the time. That could still cause hell, even with a higher-than-normal probability.
As I’d mentioned elsewhere,
I am worried about S-risks though. I think they’re pretty likely in timelines where we solve the technical problem of alignment, but the technology ends up in the wrong hands; central examples being xenophobic or authoritarian political entities.
I’m concerned it may be neglected, too: I expect the various AI Governance/field-building initiatives may not be spending any time considering how not to attract the wrong kind of attention, and instead simply maximize for getting as much attention as possible. (Though I suppose if they’re competent at it, I wouldn’t see any public evidence of them considering that; I’m just guessing on priors.)
Edit: Mm, though there’s a caveat. I’m operating under the least forgiving model of the alignment problem; under it, S-risks really are that unlikely. But many people don’t share it — e. g., the shard theory assumes “rough” alignment will suffice to avoid omnicide — which should make their P(hell) non-negligibly high. Yet they’re not worried either, so there must be something else going on with their models.
Thx! Yep, your edit basically captures most of what I would reply. If alignment turns out so hard that we can’t get any semblance of human values encoded at all, then I’d also guess that hell is quite unlikely. But there are caveats, e.g., if there is a nonobvious inner alignment failure, we could get a system that technically doesn’t care about any semblance of human values but doesn’t make that apparent because ostensibly optimizing for human values appears useful for it at the time. That could still cause hell, even with a higher-than-normal probability.