Seth Herd comments on How to specify an alignment target

Seth Herd 3 May 2025 22:38 UTC
4 points
0
This is great! I am puzzled as to how this got so few upvotes. I just added a big upvote after getting back to reading it in full.

I think consideration of alignment targets has fallen out of favor as people have focused more on understanding current AI and technical approaches to directing it—or completely different activities for those who think we shouldn’t be trying to align LLM-based AGI at all. But I think it’s still important work that must be done before someone launches a “real” (autonomous, learning, and competent) AGI.

I agree that people mean different things by alignment targets. I also think it’s quite helpful to have a summary of how they’re defined and implemented for training current systems.

This will be my new canonical reference for what is meant by “alignment target” in the context of network-based AI.

I have only one major hesitation or caveat. I am wary of even using the term “alignment” for current LLMs, because they do not strongly pursue goals in a consequentialist way nor do they evolve over time as they learn and think, as I expect future really dangerous AI. will do. LLM AGI will have memory, and memory changes alignment is my clearest statement to date of why. (I’ve also used the term The alignment stability problem for this gap in what current prosaic alignment work addresses. However using “alignment” and “alignment target” for both current and future systems seems inevitable and more-or-less correct; I just use and think about the two uses of the terms with caution.

Your proposal of an alignment target sounds a good bit like my Instruction-following and like Max Harms’ Corrigibility as Singular Target, which I highly recommend if you want to pursue that direction.
- Richard Juggins 5 May 2025 19:45 UTC
  1 point
  0
  Parent
  Thank you for your kind words! I’m glad you liked it. Your instruction-following post is a good fit for one of my examples, so I will edit in a link to it.
  
  I agree that alignment is a somewhat awkwardly-used term. I think the original definition relies on AI having quite cleanly defined goals in a way that is probably unrealistic for sufficiently complex systems, and certainly doesn’t apply to LLMs. As a result, it often ends up being approximated to mean something more like directing a set of behavioural tendencies, like trying to teach the AI to always take the appropriate action in any given context. I tend to lean into this latter interpretation.
  
  I haven’t had time to read your other links yet but will take a look!