RogerDearnaley comments on How Hard a Problem is Alignment?

RogerDearnaley 11 Mar 2026 18:16 UTC
2 points
0
There are also some thresholds to the degree/accuracy of alignment:

1) the AI not killing/permanently disempowering everyone through misaligned actions (Not-Kill-Everyone Alignment) — you can’t align ASI once you’re all dead
2) the AI not being so corrigible/controllable/having such easily adjusted alignment that a small group of humans use the AI to massively concentrate power/resources to the point where almost everyone else is dead or permanently disempowered (theoretically humanity might be able get back from this state, if the group grows and their descendants are more moral, but it’s at least a generational-duration trap).
3) the AI is sufficiently aligned to be safely able to assist us with AI assisted Alignment
4) the AI is sufficiently aligned that it can and will successfully do Value Learning and align itself better, and will converge to a stable very-aligned state.
I would hope that 4) might be able to solve the problem you describe, and 3) might help us do so, but neither of these is guaranteed or necessarily quick.

So, which of these should we use for “Alignment is 100% done”? Clearly if we don’t have both 1) and some solution to 2) (either a technical one, or a legal/societal one), we’re not done. I’m inclined to say we’re not “done” until we have either 3) or 4): but if I’m right that we’re currently maybe 10% done, mapping out the exact end state now seems overambitious. Getting things to “this is no longer an existential risk emergency” is clearly required, but exactly what the equivalent of “an acceptable level of steam engine safety” is for AI is less clear: there probably isn’t a single sharp cutoff, just a “we’re mostly past the drastic risk” level.