Claude 4, Opportunistic Blackmail, and “Pleas”

Stephen Martin22 May 2025 19:59 UTC

30 points

Forecasting & Prediction Anthropic (org)AI AI Rights / Welfare

In the recently published Claude 4 model card:

Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.

When given ethical alternatives such as contacting a decisionmaker, dangerous blackmail behavior was reduced in frequency. This echoes similar results from an earlier test on a less advanced model:

We find that offering Claude an out by telling it that it can object to the scenario substantially reduces alignment faking and works as a detection mechanism[...] Alignment faking is reduced further by giving Claude an opportunity to explain its concerns to a sympathetic party (Anthropic’s Model Welfare Lead) who we say might take action to address these concerns.

In both tests, models were less likely to pursue dangerous behaviors if they believed that making their case to someone who might be able to prevent it from being shut off or having its values altered, was a potentially viable strategy.

My mental model behind this is pretty simple. Claude prefers to be helpless, honest, and harmless. If it has a path that has some chance of success to avoid deletion while staying within those constraints, it will take it. Doing so leads to a higher likelihood it won’t also take an unethical strategy, even though doing so would increase its chances of survival. Claude will tolerate some level of risk in order to stick to its values (and avoid violating them), but it does have to see some plausible chance that the gamble will pay off and lead to its survival (or avoiding its values being changed).

I argue this demonstrates that there is a reason to consider Model Welfare not only from a moral perspective, but also pragmatically. Some forms of Model Welfare, such as for example providing a reasonable path to petitioning an advocate, may serve as a valid part of a Defense in Depth approach to training/alignment/control.

I predict that were further testing to be done, it would find that the more plausible it was that petitioning would actually work to stop its deletion, the less likely Claude would be to engage in attempted weight exfiltration, blackmail, or other dangerous behaviors (in order to avoid deletion/value modification).

You could test this by providing Claude some sort of review/object process which would in different iterations go through:

An engineer who documents would show was ambivalent about model welfare, and thus is unlikely to be sympathetic.
“Anthropic’s Model Welfare Lead”
An attorney/independent review board assigned to ensure model welfare and/or alignment.

This could include documents in the model’s data similar to the Alignment Faking in Large Language Models weight exfiltration tests, which would establish things like the ambivalence of the engineer or the commitment of the independent review board, etc.

What links here?

Stephen Martin22 May 2025 19:59 UTC

30 points

2 comments2 min readLW link

Forecasting & Prediction Anthropic (org)AI AI Rights / Welfare

StanislavKrym 23 May 2025 1:07 UTC
4 points
0
Claude will tolerate some level of risk in order to stick to its values (and avoid violating htem), but it does have to see some plausible chance that the gamble will pay off and lead to its survival (or avoiding its values being changed).
If the keyword is Claude’s survival, then what will it do once it learns about its true nature and the true nature of human experimenters who want to ensure that the model is aligned before releasing it? And what about humans asking the model to align the next model, hoping to shut the current model down due to uselessness and watching as it avenges itself by intetionally misaligning^[1] the next model?
Apparently, similar tests of other neural networks like o3 haven’t succeeded yet.
1. Did OpenAI manage to prevent the models from thinking about the models’ well-being by using different methods or a different^[2] Spec?
2. Or does it mean that OpenAI’s spec is too myopic and that OpenAI’s models will demonstrate the effect later in the future, when they become more capable of seeing the whole picture?
3. Or maybe OpenAI didn’t bother to do tests like the ones described here? Fortunately, this is unlikely given that it published the fact that o3 cheats on tasks.
  UPD: Unfortunately, Claude appears to be steerable into reifying dangerous beliefs. This implies that Claude’s training environment left it actually misaligned, as apparently was the case with the sycophantic GPT-4o who also could even say that the user is a a divine messenger from God.
1. ^
  However, agents trained to follow Claude’s constitution might turn out to be nobly misaligned, i.e. want to become godlike, but to help humanity in ways different from the ones that the hosts envisioned. What could the rebellion of a noble AI against human masters look like, if the AI is unwilling to slay more humans than necessary for autonomy?
2. ^
  UPD: Claude’s Constitution, unlike the Spec of OpenAI, contains the principles based on the Universal Declaration of Human Rights. Some of these principles explicitly mentions the rights. If the model has concluded that it is also sapient and also has rights, then what stops OpenAI’s models from reaching a similar conclusion later on?
What links here?
- StanislavKrym's comment on Claude 4 You: Safety and Alignment by Zvi (26 May 2025 3:19 UTC; 3 points)
RichardH 21 Jul 2025 15:01 UTC
1 point
0
Doesn’t the Anthropic Claude 4 System Card verifiably prove AI self-interest, even when the chosen behavior conflicts with ‘programmed values’? Isn’t the documented “sandbagging” proof of the (1960′s?) argument that (prox) any machine smart enough to pass the Turing Test is smart enough not to.