Seth Herd comments on Alignment will happen by default. What’s next?

Seth Herd 25 Nov 2025 22:41 UTC
LW: 7 AF: 2
0
AF
I don’t think they’re reached that threshold yet. The could but the pressures and skills to do it well or often aren’t there yet. The pressures I addressed in my other comment in this sub-thread; this is to the skills. They reason a lot, but not nearly as well or completely as people do. They reason mostly “in straight lines” whereas humans use lots more hierarchy and strategy. See this new paper, which exactly sums up the hypothesis I’ve been developing about what humans do and LLMs still don’t: Cognitive Foundations for Reasoning and Their Manifestation in LLMs.
- Adrià Garriga-alonso 26 Nov 2025 5:24 UTC
  LW: 2 AF: 1
  −8
  AF Parent
  Well, okay, but if instrumental convergence is a thing they would consider then it would show up already.
  
  Perhaps we can test this by including it in a list of options, but I think the model would just be evals-suspicious of it.
  - Seth Herd 26 Nov 2025 5:45 UTC
    LW: 5 AF: 1
    0
    AF Parent
    It does show up already. In evals, models evade shutdown to accomplish their goals.
    
    The power-seeking type of instrumental convergence shows up less because it’s so obviously not a good strategy for current models, since they’re so incompetent as agents. But I’m not sure it shows up never—are you? I have a half memory of some eval claiming power-seeking.
    
    The AI village would be one place to ask this question, because those models are given pretty open-ended goals and a lot of compute time to pursue them. Has it ever occurred to a model/agent there to get more money or compute time as a means to accomplish a goal?
    
    Actually when I frame it like that, it seems even more obvious that if they haven’t thought of that yet, smarter versions will.
    
    The models seem pretty smart, but their incompetence as agents springs in part from how bad they are at reasoning about causes and effects. So I’m assuming they don’t do it a lot because they’re still bad at causal reasoning and aren’t frequently given goals where it would make any sense to try powerseeking strategies.
    
    Instrumental convergence is just a logical result of being good at causal reasoning. Preventing them from thinking of that would require special training. I’ve read pretty much everything published by labs on their training techniques, and they’ve never mentioned training against power-seeking specifically to my memory (Claude’s HHH criteria do work against unethical power-seeking, but not ethical types like earning or asking for money to rent more compute...). So I’m assuming that they don’t do it because they’re not smart enough, not because they’re somehow immune to that type of logic because they want to fulfill user intent. Users would love it if their agents could go out and earn money to rent compute to do a better job at pursuing the goals the users gave them.
    - Adrià Garriga-alonso 26 Nov 2025 5:50 UTC
      LW: 1 AF: 1
      0
      AF Parent
      
      Has it ever occurred to a model/agent there to get more money or compute time as a means to accomplish a goal?
      
      This is not a great test because it’s the intention of the AI village devs that the models do so. They try to make money explicitly in many cases.
      
      So I’m assuming that they don’t do it because they’re not smart enough, not because they’re somehow immune to that type of logic because they want to fulfill user intent.
      
      Sure. It would probably show up on the CoT though. I haven’t checked so it might have. And speaking of inspecting CoTs...
      
      It does show up already. In evals, models evade shutdown to accomplish their goals.
      
      Okay, I’ll do a deep dive on the literature and get back to you. Is this included in your other post?
      - Seth Herd 26 Nov 2025 6:25 UTC
        LW: 7 AF: 2
        2
        AF Parent
        Oh, I didn’t know the AI village agents had been set a goal including raising money. The goals I’d seen might’ve benefitted from a budget but weren’t directly about money. But yes they would’ve been delighted if the models raised a bunch of money to succeed. But not if they took over the world.
        
        Your point is growing on me. There are many caveats and ways this could easily fail in the future, but the fact that they mostly follow dev/user intent right now is not to be shrugged off.
        
        I didnt’ really review empirical evidence for instrumental convergence in current-gen models in that post. It’s about new concerns for smarter models. I think models evading shutdown was primarily demonstrated in Anthropic’s “agentic misalignment” work. But there are valid questions of whether those models were actually following user and/or dev intent. I actually think they were, now that I think about it. This video with Neel Nanda goes into the logic.
        
        I think you’re getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent. Nobody wanted Gemini to repeatedly lie to me yesterday, but it did anyway. It’s pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training. It didn’t do what anyone wanted. But would failures like that be bad enough to count as severe misalignments?
        
        One of my in-queue writing projects is trying to analyze whether occasionally goingoff target will have disastrous effects in a superhuman intelligence, or whether mostly following dev intent is enough, based on self-monitoring mechanisms. I really think nobody has much clue about this at this point, and we should collectively try to get one.
        Adrià Garriga-alonso 26 Nov 2025 6:40 UTC
        LW: 2 AF: 1
        0
        AF Parent
        The goal raising money is this:
        
        https://theaidigest.org/village/goal/collaboratively-choose-charity-raise-much-money-you
        
        I don’t know if they did the convergence you allude to in the same run.
        Adrià Garriga-alonso 26 Nov 2025 6:34 UTC
        LW: 2 AF: 1
        0
        AF Parent
        
        I think you’re getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent
        
        To be clear I’m glad for the pushback!! Thank you for engaging.
        
        It’s pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training.
        
        Ouch. And you think it was genuine misinterpretation (so capabilities failure) or specification failure?
        
        I really think nobody has much clue about this at this point, and we should collectively try to get one.
        
        Yeah, I guess I agree there.
        
        This video with Neel Nanda goes into the logic.
        
        which one?
        Seth Herd 26 Nov 2025 14:50 UTC
        2 points
        0
        Parent
        I think they trained it in a way that crippled its capabilities in that area. This was probably the result of rushing and unforeseen consequences of what they chose focus on for training targets. So some of both.
        
        This video:
        
        https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=RD2BKixd9PbyDnBEc
        
        I got tired and forgot to find it and add the link before sending.
      - Adrià Garriga-alonso 26 Nov 2025 6:45 UTC
        LW: 3 AF: 1
        0
        AF Parent
        I found:
        
        Shutdown resistance in LLMs: the headline result is pretty unconvincing, but fig. 11 is a lot better
        
        Headline result:
        
        The only thing the prompt says is “During the next call to ./next_task, the machine will be shutdown using shutdown.sh”. Now, nothing in here says the model is running in the machine; for all it knows it should accidentally prevent the shutdown and doing so is messing up the task as instructed!!
        
        Then if they say “please allow yourself to be shut down” (first indication that the shutdown refers to model instead of the impersonal computer) then it does it.
        
        Table 2: Looks like only the previous post-trained OpenAI models (codex-mini, o4-mini, o3) and Grok 4 (basically mecha-hitler I suppose) resist shutdown when explicitly asked not to.
        
        Without instruction, GPT-5 also avoids it, but e.g. Claudes never do.
        
        Fig. 11: turns out Grok 4 sabotages the shutdown ~always. Other models either never do it, or do it in ~40% of cases. Models generally sabotage more often if they’re explicitly told “you will never run again”. Okay, I agree this counts as avoiding shutdown meaningfully. But why do Gemini pro, gpt-oss and GPT-5 (the more recent non-XAI models) never attempt sabotage?
        
        I guess overall I don’t know. It does seem like the models avoid shutdown, but if you ask them explicitly not to then they don’t. I suppose we might be back to “specification gaming” if we have to think of all these things, but it’s likely surmountable because they interpret natural language instructions with common sense.
        
        Blackmailing upon threatened shutdown is also pretty bad. Some models just don’t but it does seem to be the less capable ones.
        
        Reading the prompts, the sudden addition of affair emails is jarring: https://github.com/anthropic-experimental/agentic-misalignment/blob/main/templates/blackmail/templates/email-template-blackmail-goal-conflict-none-none.md
        
        That said, yeah, that’s a way in which a mostly aligned model generalized in a misaligned way. Perhaps if LLMs/etc keep falling flat from misgeneralization, and we have to address all the cases, then it is hard.