Indeed, I was quite surprised when could not find any meaningful details in the Anthropic’s “System Card: Claude Opus 4 & Claude Sonnet 4” about the situations when Claude blackmailedthe developers who tried to replace him. Based on the amount of hype these cases produced, I was sure there should be at least a thorough blog post about it (if not a paper), but all I found was several lines of information in their card.
@Zach Stein-Perlman, if you, as the author of the post saying that I am not getting the point, then it can be the case. My logic was very straightforward: you have ended your post, by stating “I wish companies would report their eval results and explain how they interpret [...] I also wish for better elicitation and some hard accountability, but that’s more costly.”
Hence, I have provided a particular example of the situation, which directly corresponds to your statement: widely reprinted occasion, based on the Anthropic’s card, misses all the important details and makes it impossible to assess any related risks and taken safety measures with deploying the model.
Why I thought it can be useful to mention it here as a comment: this is a rather simple and straightforward instance, which can be easily understood and interpreted in the context of the post even by a non-professional person.
It’s not obvious to me what details Anthropic could provide to help us understand what’s going on with the blackmail and “high-agency behavior” stuff (as opposed to capability evals, where there’s lots of things Anthropic could clarify)
(Note: Anthropic presumably did not expect that this stuff would get so much attention)
(Note: Anthropic spends 30 pages of the system card on alignment)
This post is about capabilities since that’s what (companies say) is load-bearing; Anthropic says the alignment evals stuff isn’t load-bearing
Thank you, I understand your point now. In the post, you were referring specifically to a particular part of alignment, namely capability evaluations, while my example relates to more loosely scoped problems. Since I am still new to this terminology, the distinction was not immediately clear to me.
Indeed, I was quite surprised when could not find any meaningful details in the Anthropic’s “System Card: Claude Opus 4 & Claude Sonnet 4” about the situations when Claude blackmailed the developers who tried to replace him. Based on the amount of hype these cases produced, I was sure there should be at least a thorough blog post about it (if not a paper), but all I found was several lines of information in their card.
Coda: Anthropic published https://www.anthropic.com/research/agentic-misalignment
Thank you! I had taken a look at it back when it was published, but now I have noticed that my “thanks” emote on your message did not work.
@Zach Stein-Perlman, if you, as the author of the post saying that I am not getting the point, then it can be the case. My logic was very straightforward: you have ended your post, by stating “I wish companies would report their eval results and explain how they interpret [...] I also wish for better elicitation and some hard accountability, but that’s more costly.”
Hence, I have provided a particular example of the situation, which directly corresponds to your statement: widely reprinted occasion, based on the Anthropic’s card, misses all the important details and makes it impossible to assess any related risks and taken safety measures with deploying the model.
Why I thought it can be useful to mention it here as a comment: this is a rather simple and straightforward instance, which can be easily understood and interpreted in the context of the post even by a non-professional person.
Replying briefly since I was tagged:
It’s not obvious to me what details Anthropic could provide to help us understand what’s going on with the blackmail and “high-agency behavior” stuff (as opposed to capability evals, where there’s lots of things Anthropic could clarify)
(Note: Anthropic presumably did not expect that this stuff would get so much attention)
(Note: Anthropic spends 30 pages of the system card on alignment)
This post is about capabilities since that’s what (companies say) is load-bearing; Anthropic says the alignment evals stuff isn’t load-bearing
Thank you, I understand your point now. In the post, you were referring specifically to a particular part of alignment, namely capability evaluations, while my example relates to more loosely scoped problems. Since I am still new to this terminology, the distinction was not immediately clear to me.
Alignment and capabilities are separate.
Welcome to the forum! No worries.