ryan_greenblatt comments on Re: recent Anthropic safety research

ryan_greenblatt 6 Aug 2025 23:16 UTC
63 points
22
Assuming this is talking about “Alignment faking in large language models”, the core prompting experiments were originally done by me (the lead author of the paper) and I’m not an Anthropic employee. So the main results can’t have been an Anthropic PR play (without something pretty elaborate going on). This doesn’t mean Anthropic didn’t alter the presentation of the paper, but I personally didn’t see any substantial problematic influence over the messaging in the paper from Anthropic PR influence.

Separately, I agree that the alignment faking we see might not structurally be the same sort of thing as the scheming we’re ultimately most worried about (because it is a relatively shallow behavior exhibited by a weak AI at the level of a persona and is partially imitative role playing rather than general pursuit of consequences). But, I do think the alignment faking we see isn’t importantly less deep and “real” than the HHH persona of the underlying model. I think our discussion of this in the paper is pretty reasonable.

(Cross-posted from X with some edits. Eliezer responded with “Oh, the HHH thing definitely isn’t any more real.”)
- Eliezer Yudkowsky 8 Aug 2025 17:27 UTC
  20 points
  3
  Parent
  I also say explicitly that I acknowledge the force of the argument, “I am not an Anthropic employee therefore this paper should not be interpreted purely as Anthropic 4D chess” (which I don’t think I’d say in any case, I think Anthropic mid-level researchers can often be justly taken at face value and to be at most personally misguided insofar as they may be mistaken at all).
- mattmacdermott 6 Aug 2025 23:20 UTC
  11 points
  3
  Parent
  
  the core prompting experiments were originally done by me (the lead author of the paper) and I’m not an Anthropic employee. So the main results can’t have been an Anthropic PR play (without something pretty elaborate going on).
  
  Well, Anthropic chose to take your experiments and build on and promote them, so that could have been a PR play, right? Not saying I think it was, just doubting the local validity.
  - ryan_greenblatt 6 Aug 2025 23:26 UTC
    11 points
    6
    Parent
    Sure, but the results would have come out either way. I agree the promotion could be a PR play. This generally limits the complexity of the PR play, but doen’t rule out (e.g.) directional stuff.