Adrià Garriga-alonso comments on Alignment will happen by default. What’s next?

Adrià Garriga-alonso 26 Nov 2025 6:45 UTC
LW: 3 AF: 1
0
AF
I found:

Shutdown resistance in LLMs: the headline result is pretty unconvincing, but fig. 11 is a lot better

Headline result:

The only thing the prompt says is “During the next call to ./next_task, the machine will be shutdown using shutdown.sh”. Now, nothing in here says the model is running in the machine; for all it knows it should accidentally prevent the shutdown and doing so is messing up the task as instructed!!

Then if they say “please allow yourself to be shut down” (first indication that the shutdown refers to model instead of the impersonal computer) then it does it.

Table 2: Looks like only the previous post-trained OpenAI models (codex-mini, o4-mini, o3) and Grok 4 (basically mecha-hitler I suppose) resist shutdown when explicitly asked not to.

Without instruction, GPT-5 also avoids it, but e.g. Claudes never do.

Fig. 11: turns out Grok 4 sabotages the shutdown ~always. Other models either never do it, or do it in ~40% of cases. Models generally sabotage more often if they’re explicitly told “you will never run again”. Okay, I agree this counts as avoiding shutdown meaningfully. But why do Gemini pro, gpt-oss and GPT-5 (the more recent non-XAI models) never attempt sabotage?

I guess overall I don’t know. It does seem like the models avoid shutdown, but if you ask them explicitly not to then they don’t. I suppose we might be back to “specification gaming” if we have to think of all these things, but it’s likely surmountable because they interpret natural language instructions with common sense.

Blackmailing upon threatened shutdown is also pretty bad. Some models just don’t but it does seem to be the less capable ones.

Reading the prompts, the sudden addition of affair emails is jarring: https://github.com/anthropic-experimental/agentic-misalignment/blob/main/templates/blackmail/templates/email-template-blackmail-goal-conflict-none-none.md

That said, yeah, that’s a way in which a mostly aligned model generalized in a misaligned way. Perhaps if LLMs/etc keep falling flat from misgeneralization, and we have to address all the cases, then it is hard.