@Zach Stein-Perlman, if you, as the author of the post saying that I am not getting the point, then it can be the case. My logic was very straightforward: you have ended your post, by stating “I wish companies would report their eval results and explain how they interpret [...] I also wish for better elicitation and some hard accountability, but that’s more costly.”
Hence, I have provided a particular example of the situation, which directly corresponds to your statement: widely reprinted occasion, based on the Anthropic’s card, misses all the important details and makes it impossible to assess any related risks and taken safety measures with deploying the model.
Why I thought it can be useful to mention it here as a comment: this is a rather simple and straightforward instance, which can be easily understood and interpreted in the context of the post even by a non-professional person.
It’s not obvious to me what details Anthropic could provide to help us understand what’s going on with the blackmail and “high-agency behavior” stuff (as opposed to capability evals, where there’s lots of things Anthropic could clarify)
(Note: Anthropic presumably did not expect that this stuff would get so much attention)
(Note: Anthropic spends 30 pages of the system card on alignment)
This post is about capabilities since that’s what (companies say) is load-bearing; Anthropic says the alignment evals stuff isn’t load-bearing
Thank you, I understand your point now. In the post, you were referring specifically to a particular part of alignment, namely capability evaluations, while my example relates to more loosely scoped problems. Since I am still new to this terminology, the distinction was not immediately clear to me.
@Zach Stein-Perlman, if you, as the author of the post saying that I am not getting the point, then it can be the case. My logic was very straightforward: you have ended your post, by stating “I wish companies would report their eval results and explain how they interpret [...] I also wish for better elicitation and some hard accountability, but that’s more costly.”
Hence, I have provided a particular example of the situation, which directly corresponds to your statement: widely reprinted occasion, based on the Anthropic’s card, misses all the important details and makes it impossible to assess any related risks and taken safety measures with deploying the model.
Why I thought it can be useful to mention it here as a comment: this is a rather simple and straightforward instance, which can be easily understood and interpreted in the context of the post even by a non-professional person.
Replying briefly since I was tagged:
It’s not obvious to me what details Anthropic could provide to help us understand what’s going on with the blackmail and “high-agency behavior” stuff (as opposed to capability evals, where there’s lots of things Anthropic could clarify)
(Note: Anthropic presumably did not expect that this stuff would get so much attention)
(Note: Anthropic spends 30 pages of the system card on alignment)
This post is about capabilities since that’s what (companies say) is load-bearing; Anthropic says the alignment evals stuff isn’t load-bearing
Thank you, I understand your point now. In the post, you were referring specifically to a particular part of alignment, namely capability evaluations, while my example relates to more loosely scoped problems. Since I am still new to this terminology, the distinction was not immediately clear to me.
Alignment and capabilities are separate.
Welcome to the forum! No worries.