Seth Herd comments on A Conservative Vision For AI Alignment

Seth Herd 22 Aug 2025 0:03 UTC
8 points
4
This post is about alignment targets, or what we want an AGI to do, a mostly separable topic from technical alignment, or how we get an AGI to do anything in particular. See my Conflating value alignment and intent alignment is causing confusion for more.
There’s a pretty strong argument that technical alignment is far more pressing, so much so that addressing alignment targets right now really is barely-helpful when compared to doing nothing, and anti-helpful relative to working on technical alignment or “societal alignment” (getting-our-shit-collectively-together).
In particular, those actually in charge of building AGI will want it aligned to their own intent, and they’ll have an excuse because it’s probably genuinely a good bit more dangerous to aim directly at value alignment rather than aim for some sort of long reflection. Instruction-following is the current default alignment target and it will likely continue to be through our first AGI(s) because it offers substantial corrigibility. Value alignment does not, so we have to get it right on the first real try in both technical and the wisdom sense you address here.
More on that argument here.
- Davidmanheim 22 Aug 2025 7:53 UTC
  4 points
  0
  Parent
  Yes, this is partly true, but assumes that we can manage technical alignment in a way that is separable from the values we are aiming towards—something that I would have assumed was true before we saw the shape of LLM “alignment” solutions, but no longer think is obvious.
  And instruction-following is deeply worrying as an ‘alignment target’, since it doesn’t say anything about what ends up happening, much less actually guarantee corrigibility—especially since we’re not getting meaningful oversight - but that’s a very different argument than the one we’re making here.
  - Seth Herd 22 Aug 2025 13:56 UTC
    3 points
    0
    Parent
    I agree that instruction-following is deeply worrying as an alignment target. But it seems it’s what developers will use. Wishing it otherwise won’t make it so. And they’re right that the shape of current LLM alignment “solutions” includes values. I don’t think those are actual solutions to the hard part of the alignment problem. If developers use something like HHH as a major component of the alignment for an AGI, I think that and other targets have pretty obvious outer alignment, failure, modes, and probably less obvious inner alignment problems, so that approach simply fails.
    
    I think in hope that as they approach AGI and take the alignment problem somewhat seriously, developers will try to make instruction following the dominant component. Instruction-following does seem to help non-trivially with the hard part because it provides corrigibility and other useful flexibility. See those links for more.
    
    That is a complex and debatable argument, but the argument that developers will pursue intent alignment seems simpler and stronger. Having an AGI that primarily does your bidding seems safer and more self-interested than one that makes fully autonomous decisions based on some attempt to define everyone’s ideal values.
    - Davidmanheim 23 Aug 2025 17:25 UTC
      2 points
      0
      Parent
      “This is what’s happening and we’re not going to change it” isn’t helpful—both because it’s just saying we’re all going to die, and because it fails to specify what we’d like to have happen instead. We’re not proposing a specific course for us to influence AI developers, we’re first trying to figure out what future we’d want.