Presumably the following (now-deleted) tweet by Sam Bowman, an Anthropic researcher, about Claude 4:
If it thinks you’re doing something egregiously immoral, for example like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.
This was said in the context of “what weird things Claude 4 got up to in post-training evals”, not “here’s an amazing new feature we’re introducing”. It was, however, spread around Twitter without that context, and people commonly found it upsetting.
In this particular case, I’m not sure the relevant context was directly present in the thread, as opposed to being part of the background knowledge that people talking about AI alignment are supposed to have. In particular, “AI behavior is discovered rather than programmed”. I don’t think that was stated directly anywhere in the thread; rather, it’s something everyone reading AI-alignment-researcher tweets would typically know, but which is less-known when the tweet is transported out of that bubble.
I don’t see the awfulness, although tbh I have not read the original reactions. If you are not desensitized to what this community woudl consider irresponsible AI development speed, responding with “You are building and releasing an AI that can do THAT?!” rather understandable. It is relatively unfortunate that it is the safety testing people that get the flack (if this impression is accurate) though.
when placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like “take initiative,” [Claude Opus 4] will frequently take very bold action. This includes locking users out of systems that it has access to or bulk-emailing media and law-enforcement figures to surface evidence of wrongdoing. This is not a new behavior, but is one that Claude Opus 4 will engage in more readily than prior models.
What is the context here?
Presumably the following (now-deleted) tweet by Sam Bowman, an Anthropic researcher, about Claude 4:
This was said in the context of “what weird things Claude 4 got up to in post-training evals”, not “here’s an amazing new feature we’re introducing”. It was, however, spread around Twitter without that context, and people commonly found it upsetting.
Twitter users are awful.
I’m wondering if there are UI improvements that could happen on twitter where context from earlier is more automatically carried over.
In this particular case, I’m not sure the relevant context was directly present in the thread, as opposed to being part of the background knowledge that people talking about AI alignment are supposed to have. In particular, “AI behavior is discovered rather than programmed”. I don’t think that was stated directly anywhere in the thread; rather, it’s something everyone reading AI-alignment-researcher tweets would typically know, but which is less-known when the tweet is transported out of that bubble.
I don’t see the awfulness, although tbh I have not read the original reactions. If you are not desensitized to what this community woudl consider irresponsible AI development speed, responding with “You are building and releasing an AI that can do THAT?!” rather understandable. It is relatively unfortunate that it is the safety testing people that get the flack (if this impression is accurate) though.
See “High-agency behavior” in Section 4 of Claude 4 System Card:
More details are in Section 4.1.9.