Sean Herrington comments on How Hard a Problem is Alignment?

Sean Herrington 11 Mar 2026 17:45 UTC
1 point
0
I’m not one of the impossible crowd, but I had a long discussion with @Remmelt about his views on this, and a significant part of the argument (at least the part I understood and feel I can adequately convey) seems to be more about keeping an AI aligned once you have it there. A vague outline goes something like this:
- Imagine you have an aligned ASI
- A sufficiently powerful system will likely have subsystems
- There are always going to be aspects of the subsystems which cannot be controlled for
- Those aspects will provide natural variance within which natural selection will occur
- Gradually, we end up with a system which wants to reproduce (or at the very least with capable subsystems which do, and are capable of it in a way the system cannot control)
- Purely reproduction oriented AIs are misaligned
Perhaps this would be best phrased as “there is a capability level above which it is impossible to align an ASI”, but I think that these dynamics obviously apply to modern day systems as well.
- Seth Herd 11 Mar 2026 22:41 UTC
  2 points
  0
  Parent
  This is assuming a highly intelligent system can’t adequately anticipate problems with its overall function. I’ve read them carefully, and I don’t buy this argument. The “proofs” are that misalignment happens by the end of time. I think improvements in intelligence likely outpace this problem so the practical answer is that this takes longer than the universe lasts, at least for an ASI that strongly “wants” to maintain its goals/values (as is instrumentally convergent under many reasonable assumptions).
- RogerDearnaley 11 Mar 2026 18:16 UTC
  2 points
  0
  Parent
  There are also some thresholds to the degree/accuracy of alignment:
  
  1) the AI not killing/permanently disempowering everyone through misaligned actions (Not-Kill-Everyone Alignment) — you can’t align ASI once you’re all dead
  2) the AI not being so corrigible/controllable/having such easily adjusted alignment that a small group of humans use the AI to massively concentrate power/resources to the point where almost everyone else is dead or permanently disempowered (theoretically humanity might be able get back from this state, if the group grows and their descendants are more moral, but it’s at least a generational-duration trap).
  3) the AI is sufficiently aligned to be safely able to assist us with AI assisted Alignment
  4) the AI is sufficiently aligned that it can and will successfully do Value Learning and align itself better, and will converge to a stable very-aligned state.
  I would hope that 4) might be able to solve the problem you describe, and 3) might help us do so, but neither of these is guaranteed or necessarily quick.
  
  So, which of these should we use for “Alignment is 100% done”? Clearly if we don’t have both 1) and some solution to 2) (either a technical one, or a legal/societal one), we’re not done. I’m inclined to say we’re not “done” until we have either 3) or 4): but if I’m right that we’re currently maybe 10% done, mapping out the exact end state now seems overambitious. Getting things to “this is no longer an existential risk emergency” is clearly required, but exactly what the equivalent of “an acceptable level of steam engine safety” is for AI is less clear: there probably isn’t a single sharp cutoff, just a “we’re mostly past the drastic risk” level.