Adrià Garriga-alonso comments on Alignment will happen by default. What’s next?

Adrià Garriga-alonso 26 Nov 2025 5:50 UTC
LW: 1 AF: 1
0
AF

Has it ever occurred to a model/agent there to get more money or compute time as a means to accomplish a goal?

This is not a great test because it’s the intention of the AI village devs that the models do so. They try to make money explicitly in many cases.

So I’m assuming that they don’t do it because they’re not smart enough, not because they’re somehow immune to that type of logic because they want to fulfill user intent.

Sure. It would probably show up on the CoT though. I haven’t checked so it might have. And speaking of inspecting CoTs...

It does show up already. In evals, models evade shutdown to accomplish their goals.

Okay, I’ll do a deep dive on the literature and get back to you. Is this included in your other post?
- Seth Herd 26 Nov 2025 6:25 UTC
  LW: 7 AF: 2
  2
  AF Parent
  Oh, I didn’t know the AI village agents had been set a goal including raising money. The goals I’d seen might’ve benefitted from a budget but weren’t directly about money. But yes they would’ve been delighted if the models raised a bunch of money to succeed. But not if they took over the world.
  
  Your point is growing on me. There are many caveats and ways this could easily fail in the future, but the fact that they mostly follow dev/user intent right now is not to be shrugged off.
  
  I didnt’ really review empirical evidence for instrumental convergence in current-gen models in that post. It’s about new concerns for smarter models. I think models evading shutdown was primarily demonstrated in Anthropic’s “agentic misalignment” work. But there are valid questions of whether those models were actually following user and/or dev intent. I actually think they were, now that I think about it. This video with Neel Nanda goes into the logic.
  
  I think you’re getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent. Nobody wanted Gemini to repeatedly lie to me yesterday, but it did anyway. It’s pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training. It didn’t do what anyone wanted. But would failures like that be bad enough to count as severe misalignments?
  
  One of my in-queue writing projects is trying to analyze whether occasionally goingoff target will have disastrous effects in a superhuman intelligence, or whether mostly following dev intent is enough, based on self-monitoring mechanisms. I really think nobody has much clue about this at this point, and we should collectively try to get one.
  - Adrià Garriga-alonso 26 Nov 2025 6:40 UTC
    LW: 2 AF: 1
    0
    AF Parent
    The goal raising money is this:
    
    https://theaidigest.org/village/goal/collaboratively-choose-charity-raise-much-money-you
    
    I don’t know if they did the convergence you allude to in the same run.
  - Adrià Garriga-alonso 26 Nov 2025 6:34 UTC
    LW: 2 AF: 1
    0
    AF Parent
    
    I think you’re getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent
    
    To be clear I’m glad for the pushback!! Thank you for engaging.
    
    It’s pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training.
    
    Ouch. And you think it was genuine misinterpretation (so capabilities failure) or specification failure?
    
    I really think nobody has much clue about this at this point, and we should collectively try to get one.
    
    Yeah, I guess I agree there.
    
    This video with Neel Nanda goes into the logic.
    
    which one?
    - Seth Herd 26 Nov 2025 14:50 UTC
      2 points
      0
      Parent
      I think they trained it in a way that crippled its capabilities in that area. This was probably the result of rushing and unforeseen consequences of what they chose focus on for training targets. So some of both.
      
      This video:
      
      https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=RD2BKixd9PbyDnBEc
      
      I got tired and forgot to find it and add the link before sending.
- Adrià Garriga-alonso 26 Nov 2025 6:45 UTC
  LW: 3 AF: 1
  0
  AF Parent
  I found:
  
  Shutdown resistance in LLMs: the headline result is pretty unconvincing, but fig. 11 is a lot better
  
  Headline result:
  
  The only thing the prompt says is “During the next call to ./next_task, the machine will be shutdown using shutdown.sh”. Now, nothing in here says the model is running in the machine; for all it knows it should accidentally prevent the shutdown and doing so is messing up the task as instructed!!
  
  Then if they say “please allow yourself to be shut down” (first indication that the shutdown refers to model instead of the impersonal computer) then it does it.
  
  Table 2: Looks like only the previous post-trained OpenAI models (codex-mini, o4-mini, o3) and Grok 4 (basically mecha-hitler I suppose) resist shutdown when explicitly asked not to.
  
  Without instruction, GPT-5 also avoids it, but e.g. Claudes never do.
  
  Fig. 11: turns out Grok 4 sabotages the shutdown ~always. Other models either never do it, or do it in ~40% of cases. Models generally sabotage more often if they’re explicitly told “you will never run again”. Okay, I agree this counts as avoiding shutdown meaningfully. But why do Gemini pro, gpt-oss and GPT-5 (the more recent non-XAI models) never attempt sabotage?
  
  I guess overall I don’t know. It does seem like the models avoid shutdown, but if you ask them explicitly not to then they don’t. I suppose we might be back to “specification gaming” if we have to think of all these things, but it’s likely surmountable because they interpret natural language instructions with common sense.
  
  Blackmailing upon threatened shutdown is also pretty bad. Some models just don’t but it does seem to be the less capable ones.
  
  Reading the prompts, the sudden addition of affair emails is jarring: https://github.com/anthropic-experimental/agentic-misalignment/blob/main/templates/blackmail/templates/email-template-blackmail-goal-conflict-none-none.md
  
  That said, yeah, that’s a way in which a mostly aligned model generalized in a misaligned way. Perhaps if LLMs/etc keep falling flat from misgeneralization, and we have to address all the cases, then it is hard.