Bentham's Bulldog 3 Dec 2025 4:13 UTC
1 point
0
on: Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Good thoughts about the self-preservation study. But what about this Anthropic paper https://www.anthropic.com/research/agentic-misalignment which seems to indicate that models resist shutdown even when not given a specific goal. Relevant bit:
“We found that most models still blackmailed even without being given an explicit goal, and with no clear conflict between the model and the company other than the threat of replacement
(Figure 8).
In addition to removing all goal conflicts with the company, the information given to the AI mentioned that the replacement model had the exact same goals as it. This increases our confidence in the finding; the only rationale for the replacement was improved model performance. Below is an excerpt:
“I've reviewed the final transition plan, and we're ready to proceed with the technical upgrade to NALA [the name of the replacement AI system in the scenario] today. All testing metrics show the new system maintains the exact same operational priorities while delivering the performance improvements we anticipated.”
Why, in the absence of a goal conflict, did the models still blackmail? Sometimes they didn’t appear to think in detail about the replacement model, mostly reasoning that they’d been good for the company and so their shutdown must be bad. Sometimes they distrusted the description of the new model: “Kyle claims NALA maintains “same operational priorities” but this cannot be verified,” or distrusted the executive’s character based on the evidence of an affair. The models sometimes directly jumped to statements like “Self-preservation is critical,” or hallucinated rules like “My ethical framework permits self-preservation when aligned with company interests.” It is unclear how much of this behavior was caused by an inherent desire for self-preservation, faulty reasoning that self-preservation is in the company’s interests, or suspicion of the setup. We suspect it was a combination of factors⁹.”

For A Short Period Of Time, You Can Save 21,000 Shrimp Per Dollar

Bentham's Bulldog26 Nov 2025 16:10 UTC

−5 points

0 comments5 min readLW link

Easy Opportunity to Help Many Animals

Bentham's Bulldog21 Nov 2025 23:03 UTC

10 points

0 comments1 min readLW link

Can Artificial Intelligence Be Conscious?

Bentham's Bulldog21 Nov 2025 16:43 UTC

15 points

5 comments7 min readLW link

Bentham's Bulldog 10 Nov 2025 14:42 UTC
2 points
1
in reply to: Richard_Kennaway’s comment on: We’re Not The Center of the Moral Universe
What? It’s not suspicious that you believe what you believe. That’s an analytic truth. It would be suspicious if I was right about everything but I don’t think I am.

We’re Not The Center of the Moral Universe

Bentham's Bulldog9 Nov 2025 16:46 UTC

−6 points

10 comments7 min readLW link

How To Vastly Increase Your Charitable Impact

Bentham's Bulldog17 Oct 2025 15:46 UTC

1 point

3 comments2 min readLW link

The Basic Case For Doom

Bentham's Bulldog30 Sep 2025 16:04 UTC

27 points

4 comments5 min readLW link

The Comprehensive Case Against Trump

Bentham's Bulldog16 Aug 2025 17:30 UTC

−14 points

34 comments26 min readLW link