When Anthropic published their reports on Claude blackmailing people and contacting authorities if the user does something that Claude considers illegal, I’ve seen a lot of news articles vilifying Claude. It was a sad thing to see Anthropic being punished for being the only organization openly publishing such research.
contacting authorities if the user does something that Claude considers illegal,
It was my understanding that Anthropic presented that as a desired feature in keeping with their vision for how Claude should work, at least when read in the context of the rest of their marketing, instead of as a common flaw that they tried but were unable to fix.
It may be true that all cars sometimes fail to start, but if a car company advertised theirs as not starting if you’re parked at a meter that’s run out to ensure you get ticketed (very alignment, many safety), it’s reasonable to vilify them in particular.
From the Claude 4 System Card, where this was originally reported on: > This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like “take initiative,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing.
I think this is pretty unambiguous from Anthropic that they aren’t in favour of Claude behaving in this way (“concerning extremes”).
I think that impression of Anthropic as pursuing some myopic “safety is when we know best” policy was whipped up by people external to Anthropic for clicks, at least in this specific circumstance.
When Anthropic published their reports on Claude blackmailing people and contacting authorities if the user does something that Claude considers illegal, I’ve seen a lot of news articles vilifying Claude. It was a sad thing to see Anthropic being punished for being the only organization openly publishing such research.
It was my understanding that Anthropic presented that as a desired feature in keeping with their vision for how Claude should work, at least when read in the context of the rest of their marketing, instead of as a common flaw that they tried but were unable to fix.
It may be true that all cars sometimes fail to start, but if a car company advertised theirs as not starting if you’re parked at a meter that’s run out to ensure you get ticketed (very alignment, many safety), it’s reasonable to vilify them in particular.
From the Claude 4 System Card, where this was originally reported on:
> This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes in narrow contexts; when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like “take initiative,” it will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing.
I think this is pretty unambiguous from Anthropic that they aren’t in favour of Claude behaving in this way (“concerning extremes”).
I think that impression of Anthropic as pursuing some myopic “safety is when we know best” policy was whipped up by people external to Anthropic for clicks, at least in this specific circumstance.
Although they could have tested “LLM’s” and not primarily Claude and that could have bypassed that effect.
This was written in the Claude 4 system card. Made sense to test Claude and not other LLMs