Seth Herd comments on Alignment will happen by default. What’s next?

Seth Herd 26 Nov 2025 6:25 UTC
LW: 7 AF: 2
2
AF
Oh, I didn’t know the AI village agents had been set a goal including raising money. The goals I’d seen might’ve benefitted from a budget but weren’t directly about money. But yes they would’ve been delighted if the models raised a bunch of money to succeed. But not if they took over the world.

Your point is growing on me. There are many caveats and ways this could easily fail in the future, but the fact that they mostly follow dev/user intent right now is not to be shrugged off.

I didnt’ really review empirical evidence for instrumental convergence in current-gen models in that post. It’s about new concerns for smarter models. I think models evading shutdown was primarily demonstrated in Anthropic’s “agentic misalignment” work. But there are valid questions of whether those models were actually following user and/or dev intent. I actually think they were, now that I think about it. This video with Neel Nanda goes into the logic.

I think you’re getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent. Nobody wanted Gemini to repeatedly lie to me yesterday, but it did anyway. It’s pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training. It didn’t do what anyone wanted. But would failures like that be bad enough to count as severe misalignments?

One of my in-queue writing projects is trying to analyze whether occasionally goingoff target will have disastrous effects in a superhuman intelligence, or whether mostly following dev intent is enough, based on self-monitoring mechanisms. I really think nobody has much clue about this at this point, and we should collectively try to get one.
- Adrià Garriga-alonso 26 Nov 2025 6:40 UTC
  LW: 2 AF: 1
  0
  AF Parent
  The goal raising money is this:
  
  https://theaidigest.org/village/goal/collaboratively-choose-charity-raise-much-money-you
  
  I don’t know if they did the convergence you allude to in the same run.
- Adrià Garriga-alonso 26 Nov 2025 6:34 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  I think you’re getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent
  
  To be clear I’m glad for the pushback!! Thank you for engaging.
  
  It’s pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training.
  
  Ouch. And you think it was genuine misinterpretation (so capabilities failure) or specification failure?
  
  I really think nobody has much clue about this at this point, and we should collectively try to get one.
  
  Yeah, I guess I agree there.
  
  This video with Neel Nanda goes into the logic.
  
  which one?
  - Seth Herd 26 Nov 2025 14:50 UTC
    2 points
    0
    Parent
    I think they trained it in a way that crippled its capabilities in that area. This was probably the result of rushing and unforeseen consequences of what they chose focus on for training targets. So some of both.
    
    This video:
    
    https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=RD2BKixd9PbyDnBEc
    
    I got tired and forgot to find it and add the link before sending.