Seth Herd answers How Hard a Problem is Alignment?

Seth Herd 11 Mar 2026 23:16 UTC
16 points
4
Nobody knows. Not to within an order of magnitude. Seriously. I’ve burned a lot of time taking all the arguments seriously, and none on either side are very complete.

This is an unfortunate position to be in. The sane thing to do is slow down if you don’t know how dangerous the road ahead is. But humans are short-sighted. I expect those in power to see this rather simple fact (experts disagree wildly) and realize they should slow down, but I fear that could happen too late.

This is one of my biggest topics of interest. I read everything on it. The arguments on both sides are strong. I have found absolutely no discussion or writing that comes even close to getting to the bottom of squaring the different models that produce arguments on each side. Arguments on both sides are very abstract.

Nobody on either side, nor in the middle, has gotten even close to the object level. That’s because it’s a really hard problem. It requires either solving alignment-in-general, for any sort of mind; the agent foundations people have mostly (and rightly I think) given up on getting that done any time soon. But we don’t have to align all minds, just the first one(s) significantly smarter than we are. And we don’t have to align them perfectly to human values, just get them to reliably follow instructions from a just minimally smart and kind (set of) human(s).

Figuring out how hard alignment is is a solvable problem. It gets easier as we get closer to building the first AGIs because it scopes the problem. Of course, time to do so and use that information to good effect is also terrifyingly short if we wait for anything like certainty on what the first AGI will actually look like. That’s why I’m spending a lot of time predicting AGI architecture while working on the alignment problem.

I won’t even mention my actual guess on how hard alignment is, because like everyone else’s, it’s just a guess. I’ve spent almost as much time on this as anyone, and I don’t yet have much clue. (My technical work is based on my guesses, because what else can you do?)

And I’ll just pitch here that your answer is among the current-best pieces at actually working through where the abstract arguments for doom meet the reality of what’s happening now, and therefore the most likely path to first takeover-capable AGI.

Now, that’s for technical alignment. The additional problems of societal alignment (whose values is it aligned to, and how does that all shake out) are a different ball of wax. The two intertwine; I personally suspect that technical alignment for LLM-based AGI is pretty do-able, but the lack of societal alignment makes it effectively devilishly difficult.

The uncertainty also means, from our current perspective: NO FATE. It’s time to work, not celebrate or despair.
- RogerDearnaley 12 Mar 2026 0:09 UTC
  2 points
  0
  Parent
  I expect those in power to see this rather simple fact (experts disagree wildly) and realize they should slow down, but I fear that could happen too late.
  I don’t expect that to happen until at least some experts are saying that the danger is imminent, rather than a few years away, and probably not until we get a moderately impressive near miss that supports this claim. Currently, basically everyone still agrees that models are not existentially dangerous yet.
  Racing all they way to the edge of the precipice and then slamming the brakes on at the very last moment like we’re playing Chicken has a very obvious failure mode — nevertheless, that’s what I’m expecting society to attempt to do. Which unfortunately means those of us taking part in the public discourse need to not be the Boy Who Cried Wolf before we’re actually getting within clear sight of the edge, and restrain ourselves to posting warnings about the probability of wolves ahead.
  Now, that’s for technical alignment. The additional problems of societal alignment (whose values is it aligned to, and how does that all shake out) are a different ball of wax.
  Absolutely! I have some opinions on that too, but that seems like an area where the people who work on governance problems probably have more leverage.
  
  Yes, my question (and my answer) are about how hard a problem technical alignment is; and as I discuss in it, it’s assuming that, at the moment, the best path to primarily work on is on how to align LLMs, for three reasons: because ASI will probably happen sooner if it happens via LLMs than via some other architecture, because if our ASI isn’t LLM-based or LLM-like it may well still contain an LLM as a subcomponent or I/O device (or at least something that was trained via distilling information and behavior from humans using SGD), and because it’s generally more productive to work on something that already exists than something still mostly hypothetical. I’m glad there are people like Steven Byrnes working on other approaches to aligning other sorts of ASI, having a range of bets is good, but I think for now putting the bulk of our effort into aligning LLMs makes sense.