Seth Herd comments on If AI alignment is only as hard as building the steam engine, then we likely still die

Seth Herd 11 Jan 2026 17:44 UTC
3 points
1
I assume you mean that it wanted human flourishing or something similar, so it’s figuring out exactly what that is and how to work toward it.

Choosing an alignment target wisely might make technical alignment a bit easier, but probably not a lot easier. So I think this is a fairly separate topic from the rest of the post. But it’s a topic I’m interested in, so here’s a few cents worth:

Or do you mean Corrigibility as Singular Target? If the main goal is to be correctable (or roughly, to follow instructions), then I think the scheme works in principle- but you’ve still got to get it aimed at that target pretty precisely, because it’s going to pursue whatever you put in there—probably with fairly clumsy training methods.

If you aimed it at a concept/value like human flourishing, that’s great IF you defined the target exactly right; if not, and you haven’t solved the fundamental problem with adding corrigibility as a secondary feature, it’s not going to help if you aimed it at the wrong target. It will self-correct toward the wrong target you trained in. I wouldn’t say it in this polarizing way, but I agree with Jeremy Gillen that The corrigibility basin of attraction is a misleading gloss in this important way.

If you got it to value human flourishing, that’s great, and it will be in a basin of alignment in a weak sense: it will try to correct mistakes and uncertainties by its criteria. If I love human fluorishing but I’m not yet sure exactly what that is or how to work toward it, I can improve my understanding of the details and methods.

But it’s not a basin in the larger sense that it will correct errors in how that target was specified. If instead of human flourishing, we accidentally gave it a slightly-off definition that amounts to “human schmourishing” that rhymes with (seems very close in some ways) but is different than human flourishing, the system itself will not correct that, and will use its capacities to resist being corrected.
- RogerDearnaley 11 Jan 2026 18:11 UTC
  2 points
  0
  Parent
  [Sorry, I was being lazy, the hotlink above to Requirements for a Basin of Attraction to Alignment was load-bearing in my comment. I’m confident you’ve read that, but clicking through and saying “oh, that argument” was assumed.]
  Agreed, you need a definition of the alignment target: not a multi-gigabyte detailed description of human values (beyond what’s already in the Internet training set), but a mission statement for the Value Learning research project of improving that, and more than just one word ‘alignment’. As for what target definition to point that to, I published my suggestion on that in Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV.
  
  We’re arguable getting a little off topic, but I agree with you that corrigibility is another self-consistent and perhaps even feasible target for that — however, I think choosing corrigibility leads to problems when you have two-or-more competing groups of humans (say, a few tech billionaires/multinationals, the US government, and the CCCP) each with their own corrigible ASI that my evolutionary psychology suggestion avoids, because that points at things like human ethical instincts and the necessity avoiding mass extinction that help the ASI understand why defusing conflicts is vital, in ways that corrigibility actively pushes against. Thus I think corrigibity is unwise chaoice an alignment target, not because it’s IMO necessarily not technically possible to align AI that way, but because of what I suspect likely happens next if you do that. See also this comment of mine [yes, I’m being lazy again]. So I’m discussing the wisdom of your suggested alignment target, rather than its practicability.