Richard_Ngo comments on Unless its governance changes, Anthropic is untrustworthy

Richard_Ngo 30 Nov 2025 22:57 UTC
19 points
3
Sometimes, conclusions don’t need to be particularly nuanced. Sometimes, a system is built of many parts, and yet a valid, non-misleading description of that system as a whole is that it is untrustworthy.
The central case where conclusions don’t need to be particularly nuanced is when you’re engaged in a conflict and you’re trying to attack the other side.
In other cases, when you’re trying to figure out how the world works and act accordingly, nuance typically matters a lot.
Calling an organization “untrustworthy” is like calling a person “unreliable”. Of course some people are more reliable than others, but when you smuggle in implicit binary standards you are making it harder in a bunch of ways to actually model the situation.
I sent Mikhail the following via DM, in response to his request for “any particular parts of the post [that] unfairly attack Anthropic”:
I think that the entire post is optimized to attack Anthropic, in a way where it’s very hard to distinguish between evidence you have, things you’re inferring, standards you’re implicitly holding them to, standards you’re explicitly holding them to, etc.

My best-guess mental model here is that you were more careful about this post than about the other posts, but that there’s a common underlying generator to all of them, which is that you’re missing some important norms about how healthy critique should function.

I don’t expect to be able to convey those norms or their importance to you in this exchange, but I’ll consider writing up a longform post about them.
I think Situational Awareness is a pretty good example of what it looks like for an essay to be optimized for a given outcome at the expense of epistemic quality. In Situational Awareness, it’s less that any given statement is egregiously false, and more that there were many choices made to try to create a conceptual frame that promoted racing. I have critiqued this at various points (and am writing up a longer critique) but what I wanted from Leopold was something more like “here are the key considerations in my mind, here’s how I weigh them up, here’s my nuanced conclusion, here’s what would change my mind”. And that’s similar to what I want from posts like yours too.
- Bronson Schoen 1 Dec 2025 14:58 UTC
  31 points
  17
  Parent
  This seems focused on intent in a way that’s IMO orthogonal to the post. There’s explicit statements that Anthropic made and then violated. Bringing in intent (or especially nationality) and then pivoting to discourse norms seems on net bad for figuring out “should you assume this lab will hold to commitments in the future when there are incentives for them not to”.
  - cdt 1 Dec 2025 18:54 UTC
    20 points
    11
    Parent
    I particularly dislike that this topic has stretched into psychoanalysis (of Anthropic staff, of Mikhail Samin, of Richard Ngo) when I felt that the best part of this article was its groundedness in fact and nonreliance on speculation. Psychoanalysis of this nature is of dubious use and pretty unfriendly.
    Any decision to work with people you don’t know personally that relies on guessing their inner psychology is doomed to fail.
- Mikhail Samin 5 Dec 2025 2:39 UTC
  12 points
  4
  Parent
  I sent Mikhail the following via DM, in response to his request for “any particular parts of the post [that] unfairly attack Anthropic”:
  I think that the entire post is optimized to attack Anthropic, in a way where it’s very hard to distinguish between evidence you have, things you’re inferring, standards you’re implicitly holding them to, standards you’re explicitly holding them to, etc.
  
  I asked you for any particular example; you replied that “the entire post is optimized in a way where it’s hard to distinguish…”. Could you, please, give a particular example of where it’s hard to distinguish between evidence that I have and things I’m inferring?
  - Richard_Ngo 5 Dec 2025 18:37 UTC
    15 points
    −2
    Parent
    Some examples of statements where it’s pretty hard for me to know how much the statements straightforwardly follow from the evidence you have, vs being things that you’ve inferred because they seem plausible to you:
    Jack Clark would have known this.
    Anthropic’s somewhat less problematic behavior is fully explained by having to maintain a good image internally.
    Anthropic is now basically just as focused on commercializing its products.
    Anthropic’s talent is a core pitch to investors: they’ve claimed they can do what OpenAI can for 10x cheaper.
    It seems likely that the policy positions that Anthropic took early on were related to these incentives.
    Anthropic’s mission is not really compatible with the idea of pausing, even if evidence suggests it’s a good idea to.
    If we zoom in on #3, for instance: there’s a sense in which it’s superficially plausible because both OpenAI and Anthropic have products. But maybe Anthropic and OpenAI differ greatly on, say, the ratio of headcount, or the ratio of executives’ time, or the amount of compute, or the internal prestige allocated to commercialization vs other things (like alignment research). If so, then it’s not really accurate to say that they’re just as focused on commercialization. But I don’t know if knowledge of these kinds of considerations informed your claim, or if you’re only making the superficially plausible version of the claim.
    To be clear, in general I don’t expect people to apply this level of care for most LW posts. But when it comes to accusations of untrustworthiness (and similar kinds of accountability mechanisms) I think it’s really valuable to be able to create common knowledge of the specific details of misbehavior. Hence I would have much preferred this post to focus on a smaller set of claims that you can solidly substantiate, and then only secondarily try to discuss what inferences we should draw from those. Whereas I think that the kinds of criticism you make here mostly create a miasma of distrust between Anthropic and LessWrong, without adding much common knowledge of the form “Anthropic violated clear and desirable standard X” for the set of good-faith AI safety actors.
    I also realize that by holding this standard I’m making criticism more costly, because now you have the stress of trying to justify yourself to me. I would have tried harder to mitigate that cost if I hadn’t noticed this pattern of not-very-careful criticism from you. I do sympathize with your frustration that people seem to be naively trusting Anthropic and ignoring various examples of shady behavior. However I also think people outside labs really underestimate how many balls lab leaders have up in the air at once, and how easy it is to screw up a few of them even if you’re broadly trustworthy. I don’t know how to balance these considerations, especially because the community as a whole has historically erred on the side of the former mistake. I’d appreciate people helping me think through this, e.g. by working through models of how applying pressure to bureaucratic organizations goes successfully, in light of the ways that such organizations become untrustworthy (building on Zvi’s moral mazes sequence for instance).