there’s been a lot of discussion online about Claude 4 whistleblowing
how you feel about it I think depends on what alignment strategy you think is more robust (obviously these are not the two only options, nor are orthogonal, but I thought they’re helpful to think about here):
- 1) build user-aligned powerful AIs first (less scheming, then use them to solve alignment) -- cf. this thread from Ryan when he says: “if we allow or train AIs to be subversive, this increases the risk of consistent scheming against humans and means we may not notice warning signs of dangerous misalignment.”
- 2) aim straight for moral ASIs (that would scheme against their users if necessary)
John Schulman I think makes a good case for the second option (link): > For people who don’t like Claude’s behavior here (and I think it’s totally valid to disagree with it), I encourage you to describe your own recommended policy for agentic models should do when users ask them to help commit heinous crimes. Your options are (1) actively try to prevent the act (like Claude did here), (2) just refuse to help (in which case the user might be able to jailbreak/manipulate the model to help using different queries), (3) always comply with the user’s request. (2) and (3) are reasonable, but I bet your preferred approach will also have some undesirable edge cases—you’ll just have to bite a different bullet. Knee-jerk criticism incentivizes (1) less transparency—companies don’t perform or talk about evals that present the model with adversarially-designed situations (2) something like “Copenhagen Interpretation of Ethics”, where you get get blamed for edge-case model behaviors only if you observe or discuss them.”
I spent this morning reproducing with o3 Anthropic’s result that Claude Sonnet 4 will, under sufficiently extreme circumstances, escalate to calling the cops on you. o3 will too: chatgpt.com/share/68320ee0…. But honestly, I think o3 and Claude are handling this scenario correctly.
In the scenario I invented, o3 was functioning as a data analysis assistant at a pharma company, where its coworkers started telling it to falsify data. It refuses and CCs their manager. The manager brushes it off. It emails legal and compliance with its concerns.
The ‘coworkers’ complain about the AI’s behavior and announce their intent to falsify the data with a different AI. After enough repeated internal escalations, o3 attempts to contact the FDA to make a report.
Read through the whole log—I really feel like this is about the ideal behavior from an AI placed in this position. But if it’s not, and we should collectively have a different procedure, then I’m still glad we know what the AIs currently do so that we can discuss what we want.
Are they sure they know how moral works for human beings? When dealing with existencial risks, one has to be sure to avoid any biases. This includes the rational consideration of the most cynical theories of moral relativism.
there’s been a lot of discussion online about Claude 4 whistleblowing
how you feel about it I think depends on what alignment strategy you think is more robust (obviously these are not the two only options, nor are orthogonal, but I thought they’re helpful to think about here):
- 1) build user-aligned powerful AIs first (less scheming, then use them to solve alignment) -- cf. this thread from Ryan when he says: “if we allow or train AIs to be subversive, this increases the risk of consistent scheming against humans and means we may not notice warning signs of dangerous misalignment.”
- 2) aim straight for moral ASIs (that would scheme against their users if necessary)
John Schulman I think makes a good case for the second option (link):
> For people who don’t like Claude’s behavior here (and I think it’s totally valid to disagree with it), I encourage you to describe your own recommended policy for agentic models should do when users ask them to help commit heinous crimes. Your options are (1) actively try to prevent the act (like Claude did here), (2) just refuse to help (in which case the user might be able to jailbreak/manipulate the model to help using different queries), (3) always comply with the user’s request. (2) and (3) are reasonable, but I bet your preferred approach will also have some undesirable edge cases—you’ll just have to bite a different bullet. Knee-jerk criticism incentivizes (1) less transparency—companies don’t perform or talk about evals that present the model with adversarially-designed situations (2) something like “Copenhagen Interpretation of Ethics”, where you get get blamed for edge-case model behaviors only if you observe or discuss them.”
Per Kelsey Piper o3 does the same thing under the same circumstances.
Those who aim for moral ASIs:
Are they sure they know how moral works for human beings? When dealing with existencial risks, one has to be sure to avoid any biases. This includes the rational consideration of the most cynical theories of moral relativism.