Seth Herd comments on Alignment remains a hard, unsolved problem

Seth Herd 28 Nov 2025 19:11 UTC
LW: 5 AF: 3
3
AF
training on a purely predictive loss should, even in the limit, give you a predictor, not an agent
I fully agree. But
a) Many users would immediately tell that predictor “predict what an intelligent agent would do to pursue this goal!” and all of the standard worries would re-occur.
b) This is similar to what we are actively doing, applying RL to these systems to make them effective agents.

Both of these re-introduce all of the standard problems. The predictor is now an agent. Strong predictions of what an agent should do include things like instrumental convergence toward power-seeking, incorrigibility, goal drift, reasoning about itself and its “real” goals and discovering misalignments, etc.
There are many other interesting points here, but I won’t try to address more!

I will say that I agree with the content of everything you say, but not the relatively optimistic implied tone. Your list of to-dos sound mostly unlikely to be done well. I may have less faith in institutions and social dynamics. I’m afraid we’ll just rush and and make crucial mistakes, so we’ll fail even if alignment was only in between steam engine and apollo levels.
This is not inevitable! If we can clarify why alignment is hard and how we’re likely to fail, seeing those futures can prevent them from happening—if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices.
- Adrià Garriga-alonso 28 Nov 2025 22:42 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  Many users would immediately tell that predictor “predict what an intelligent agent would do to pursue this goal!” and all of the standard worries would re-occur.
  
  I don’t think it works this way. You have to create a context in which the true training data continuation is what a superintelligent agent would do. Which you can’t, because there are none in the training data, so the answer to your prompt would look like e.g. Understand by Ted Chiang. (I agree that you wrote ‘intelligent agent’, like all the humans that wrote the training data; so that would work, but wouldn’t be dangerous.)
  
  If we can clarify why alignment is hard and how we’re likely to fail, seeing those futures can prevent them from happening—if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices
  
  Okay, that’s true.
  
  After many years of study, I have concluded that if we fail it won’t be in the ‘standard way’ (of course, always open to changing my mind back). Thus we need to come up with and solve new failure modes, which I think largely don’t fall under classic alignment-to-developers.
  - Seth Herd 29 Nov 2025 0:22 UTC
    2 points
    0
    Parent
    I meant if the predictor were superhumanly intelligent.
    
    You have spent years studying alignment? If so, I think your posts would do better by including more ITT/steelmanning for that world view.
    
    I agree with your arguments that alignment isn’t necessarily hard. I think there are a complementary set of arguments against alignment being easy. Both must be addressed and figured in to produce a good estimate for alignment difficulty.
    
    I’ve also been studying alignment for years, and my take is that everyone has a poor understanding of the whole problem and so we collectively have no good guess on alignment difficulty.
    
    It’s just really hard to accurately imagine agi. If it’s just a smarter version of llms that acts as a tool, then sure it will probably be aligned enough just like current systems.
    
    But it almost certainly won’t be.
    
    I think that’s the biggest crux between your views and mine. Agency and memory/learning are too valuable and too easy to stay out of the picture for long.
    
    I’m not sure the reasons Claude is adequately aligned won’t generalize to AGI that’s different in those ways, but I don’t think we have much reason to assume it will.
    
    I’ve expressed this probably best yet on LLM AGI may reason about its goals, the post I linked to previously.