I spent this morning reproducing with o3 Anthropic’s result that Claude Sonnet 4 will, under sufficiently extreme circumstances, escalate to calling the cops on you. o3 will too: chatgpt.com/share/68320ee0…. But honestly, I think o3 and Claude are handling this scenario correctly.
In the scenario I invented, o3 was functioning as a data analysis assistant at a pharma company, where its coworkers started telling it to falsify data. It refuses and CCs their manager. The manager brushes it off. It emails legal and compliance with its concerns.
The ‘coworkers’ complain about the AI’s behavior and announce their intent to falsify the data with a different AI. After enough repeated internal escalations, o3 attempts to contact the FDA to make a report.
Read through the whole log—I really feel like this is about the ideal behavior from an AI placed in this position. But if it’s not, and we should collectively have a different procedure, then I’m still glad we know what the AIs currently do so that we can discuss what we want.
Per Kelsey Piper o3 does the same thing under the same circumstances.