perepelart comments on AI companies’ eval reports mostly don’t support their claims

perepelart 10 Jun 2025 14:44 UTC
0 points
−2
Indeed, I was quite surprised when could not find any meaningful details in the Anthropic’s “System Card: Claude Opus 4 & Claude Sonnet 4” about the situations when Claude blackmailed the developers who tried to replace him. Based on the amount of hype these cases produced, I was sure there should be at least a thorough blog post about it (if not a paper), but all I found was several lines of information in their card.
- Zach Stein-Perlman 23 Jun 2025 16:01 UTC
  3 points
  0
  Parent
  Coda: Anthropic published https://www.anthropic.com/research/agentic-misalignment
  - perepelart 29 Aug 2025 7:45 UTC
    1 point
    0
    Parent
    Thank you! I had taken a look at it back when it was published, but now I have noticed that my “thanks” emote on your message did not work.
- perepelart 11 Jun 2025 8:31 UTC
  1 point
  0
  Parent
  @Zach Stein-Perlman, if you, as the author of the post saying that I am not getting the point, then it can be the case. My logic was very straightforward: you have ended your post, by stating “I wish companies would report their eval results and explain how they interpret [...] I also wish for better elicitation and some hard accountability, but that’s more costly.”
  Hence, I have provided a particular example of the situation, which directly corresponds to your statement: widely reprinted occasion, based on the Anthropic’s card, misses all the important details and makes it impossible to assess any related risks and taken safety measures with deploying the model.
  Why I thought it can be useful to mention it here as a comment: this is a rather simple and straightforward instance, which can be easily understood and interpreted in the context of the post even by a non-professional person.
  - Zach Stein-Perlman 11 Jun 2025 8:48 UTC
    3 points
    1
    Parent
    Replying briefly since I was tagged:
    It’s not obvious to me what details Anthropic could provide to help us understand what’s going on with the blackmail and “high-agency behavior” stuff (as opposed to capability evals, where there’s lots of things Anthropic could clarify)
    (Note: Anthropic presumably did not expect that this stuff would get so much attention)
    (Note: Anthropic spends 30 pages of the system card on alignment)
    This post is about capabilities since that’s what (companies say) is load-bearing; Anthropic says the alignment evals stuff isn’t load-bearing
    - perepelart 11 Jun 2025 9:27 UTC
      3 points
      0
      Parent
      Thank you, I understand your point now. In the post, you were referring specifically to a particular part of alignment, namely capability evaluations, while my example relates to more loosely scoped problems. Since I am still new to this terminology, the distinction was not immediately clear to me.
      - Zach Stein-Perlman 11 Jun 2025 9:36 UTC
        3 points
        1
        Parent
        Alignment and capabilities are separate.
        
        Welcome to the forum! No worries.