Richard_Ngo comments on Unless its governance changes, Anthropic is untrustworthy

Richard_Ngo 30 Nov 2025 18:55 UTC
113 points
50
I feel confused about how to engage with this post. I agree that there’s a bunch of evidence here that Anthropic has done various shady things, which I do think should be collected in one place. On the other hand, I keep seeing aggressive critiques from Mikhail that I think are low-quality (more context below), and I expect that a bunch of this post is “spun” in uncharitable ways.
That is, I think of the post as primarily trying to do the social move of “lower trust in Anthropic” rather than the epistemic move of “try to figure out what’s up with Anthropic”. The latter would involve discussion of considerations like: sometimes lab leaders need to change their minds. To what extent are disparities in their statements and actions evidence of deceptiveness versus changing their minds? Etc. More generally, I think of good critiques as trying to identify standards of behavior that should be met, and comparing people or organizations to those standards, rather than just throwing accusations at them.
EDIT: as one salient example, “Anthropic is untrustworthy” is an extremely low-resolution claim. Someone who was trying to help me figure out what’s up with Anthropic should e.g. help me calibrate what they mean by “untrustworthy” by comparison to other AI labs, or companies in general, or people in general, or any standard that I can agree or disagree with. Whereas someone who was primarily trying to attack Anthropic is much more likely to use that particular term as an underspecified bludgeon.
My overall sense is that people should think of the post roughly the way they think of a compilation of links, and mostly discard the narrativizing attached to it (i.e. do the kind of “blinding yourself” that Habryka talks about here).
Context: I’m thinking in particular of two critiques. The first was of Oliver Habryka. I feel pretty confident that this was a bad critique, which overstated its claims on the basis of pretty weak evidence. The second was Red Queen Bio. Again, it seemed like a pretty shallow critique: it leaned heavily on putting the phrases “automated virus-producing equipment” and “OpenAI” in close proximity to each other, without bothering to spell out clear threat models or what he actually wanted to happen instead (e.g. no biorisk companies take money from OpenAI? No companies that are capable of printing RNA sequences use frontier AI models?)
In that case I didn’t know enough about the mechanics of “virus-producing equipment” to have a strong opinion, but I made a mental note that Mikhail tended to make “spray and pray” critiques that lowered the standard of discourse. (Also, COI note: I’m friends with the founders of Red Queen Bio, and was one of the people encouraging them to get into biorisk in the first place. I’m also friends with Habryka, and have donated recently to Lightcone. EDIT to add: about ²⁄₃ of my net worth is in OpenAI shares, which could become slightly more valuable if Red Queen Bio succeeds.)
Two (even more) meta-level considerations here (though note that I don’t consider these to be as relevant as the stuff above, and don’t endorse focusing too much on them):
1. For reference, the other person I’ve drawn the most similar conclusion about was Alexey Guzey (e.g. of his critiques here, here, and in some internal OpenAI docs). I notice that he and Mikhail are both Russian. I do have some sympathy for the idea that in Russia it’s very appropriate to assume a lot of bad faith from power structures, and I wonder if that’s a generator for these critiques.
2. I’m curious if this post was also (along with the Habryka critique) one of Mikhail’s daily Inkhaven posts. If so it seems worth thinking about whether there are types of posts that should be written much more slowly, and which Inkhaven should therefore discourage from being generated by the “ship something every day” process.
What links here?
- Erich_Grunewald's comment on Charlie Steiner’s Shortform by Charlie Steiner (2 Dec 2025 7:56 UTC; 0 points)
- Mikhail Samin 30 Nov 2025 22:24 UTC
  17 points
  3
  Parent
  Sometimes, conclusions don’t need to be particularly nuanced. Sometimes, a system is built of many parts, and yet a valid, non-misleading description of that system as a whole is that it is untrustworthy.
  ___
  I dislike some of this discussion happening in the comments of this post, as I’d like the comments to focus on the facts and the inferences, not on meta. If I’m getting any details wrong, or presenting anything specific particularly uncharitably, please say that directly. The rest of this comment is only tangentially related to the post and to what I want to talk about here, but it seems good to leave a reply.
  tl;dr: attacks on the basis of ethical origin and unrelated posts are not valid, and I wish you’d have focused on the (pretty important) object-level.
  (I previously shared some of this with Richard in DMs, and he made a few edits in response; I’m thankful for them.)
  ___
  I want to say that I find it unfortunate that someone is engaging with the post on the basis that I was the person who wrote it, or on the basis of unrelated content or my cultural origin, or speculating about the context behind me having posted it.
  I attempted to add a lot on top of the bare facts of this post, because I don’t think it is a natural move for someone at Anthropic who’s very convinced all the individual facts have explanations full of details, to look at a lot of them and consider in which worlds they would be more likely. A lot of the post is aimed at an attempt to make someone who would really want to join or continue to work at Anthropic actually ask themselves the questions and make a serious attempt at answering them, without writing the bottom line first.
  Earlier in the process, a very experienced blogger told me, when talking about this post, that maybe I should’ve titled it “Anthropic: A Wolf in Sheep’s Clothing”. I think it would’ve been a better match to the contents than “untrustworthy”, but I decided to go with a weaker and less poetic title that increased the chance of people making the mental move I really want them to make, and if it’s successful, potentially incentivize the leadership of Anthropic to improve and become more trustworthy.
  But I relate to this particular post the way I would to journalistic work, with the same integrity and ethics.
  If you think that any particular parts of the post unfairly attack Anthropic, please say that; if you’re right, I’ll edit them.
  Truth is the only weapon that allows us to win, and I want our side to be known for being incredibly truthful.
  ___
  Separately, I don’t think my posts on Lightcone and Red Queen Bio are in a similar category to this post.
  Both of those were fairly low-effort. The one on Oliver Habryka basically intentionally so: I did not want to damage Lightcone beyond sharing information with people who’d want to have it. Additionally, for over a month, I did not want or plan to write it at all; but a housemate convinced me right before the start of Inkhaven that I should, and I did not want to use the skills I could gain from Lightcone against them. I don’t think it is a high-quality post. I stand by my accusations, and I think what Oliver did is mean and regretful and there are people who would not want to coordinate with him or donate to Lightcone due to these facts, and I’m happy the information reached them (and a few people reached out to me to explicitly say thanks for that).
  The one on Red Queen Bio was written as a tweet once I saw the announcement. I was told about Red Queen Bio a few weeks before the announcement, and thought that what I heard was absolutely insane: an automated lab that works with OpenAI and plans to automate virus production. Once I saw the announcement, I wrote the tweet. The goal of the tweet was to make people pay attention to what I perceived as insanity; I knew nothing about its connection to this community when writing the tweet.
  I did triple-check the contents of the tweet with the person who shared information with me, but it still was a single source, and the tweet explicitly said “I learned of a rumor”.
  (None of the information about doing anything automatically was public at that point, IIRC.)
  The purpose of the tweet was to get answers (surely it is not the case that someone would automate a lab like that with AI!) and if there aren’t any then make people pay attention to it, and potentially cause the government to intervene.
  Instead senying the important facts, only a single unimportant one was denied (Hannu said they don’t work on phages but didn’t address any of the questions), and none of the important questions were answered (instead, a somewhat misleading reply was given), so after a while, I made a Substack post, and then posted it as a LW shortform, too (making little investment in the quality; just sharing information). I understand they might not want to give honest answers for PR reasons; I would’ve understood the answer that they cannot give answers for security reasons, but, e.g., are going to have a high BSL and are consulting with top security experts to make sure it’s impossible for a resources attacker to use their equipment to do anything bad; but in fact, no answers were given. (DMing me “Our threat model is focused on state actors and we don’t want it to be publicly known; we’re going to have a BSL-n, we’re consulting with top people in cyber and bio, OpenAI’s model won’t have automated access to virus r&d/production; please don’t share this” would’ve likely caused me to delete the tweet.)
  I think it’s still somewhat insane, and I have no reason on priors to expect appropriate levels of security in a lab funded by OpenAI; I really dislike the idea of, e.g., GPT-6 having tool access to print arbitrary RNA sequences. I don’t particularly think it lowered the standard of the discourse.
  (As you can see from the reception of the shortform post and the tweet, many people are largely sympathetic to my view on this.)
  I understand these people might be your friends; in the case of Hannu, I’d appreciate it if they could simply reply to the six yes/no questions, or state the reasons they don’t want to respond.
  (My threat model is mostly around that access to software and a lab for developing viruses seems to help an AI in a loss of control scenario; + all the normal reasons why gain-of-function research is bad, and so pointing out the potential gain-of-function property seems sufficient.)
  With my epistemic situation, do you think I was unfair to Red Queen Bio in my posts?
  ___
  I dislike the idea of appeal to Inkhaven as a reason to have a dismissive stance toward a post or having it as a consideration.
  I’ve posted many low-effort posts this month; it takes about half an hour to write something, just to post something (sometimes an hour, like here; sometimes ~25 minutes, like here). Many of these were a result of me spending time talking to people about Anthropic (or spending time on other, more important things that had nothing to do with criticism of anyone) and not having time to write anything serious or important. It’s quite disappointing how little of importance I wrote this month, but the reference to this fact at all as a reason to dismiss this post is an error. My friends heard me ask for ideas for low-effort posts to make dozens of times this month. But when I posted low-effort posts, I only posted them on my empty Substack, basically as drafts, to satisfy the technical condition of having written and published a post. There isn’t a single post that I made on LessWrong to satisfy the Inkhaven goal. (Many people can attest to me saying that I might spend a lot of December turning my unpolished posts posted on Substack into posts I’d want to publish on LessWrong.)
  And this one is very much not one of my low-effort posts.
  I somewhat expected it to be posted after the end of Inkhaven; the reason I posted it on November 28 was that the post was ready.
  ___
  Most things I write about have nothing to do with criticizing others. I understand that these are the posts you happen to see; but I much more enjoy making posts about learning to constantly track cardinal directions or learning absolute pitch as an adult; about people who could’ve destroyed the world, but didn’t (even though some of them are not good people!).
  I enjoy even more to make posts that inspire others to make their lives more awesome, like in my post about making a home smarter.
  I also posted a short story about automating prisons, just to make a silly joke about
  Jalbreaking
  (Both pieces of fiction I’ve ever written I wrote at Inkhaven. The other one is published in a draft state and I’ll come back to it at some point, finish it, and post on LessWrong: it’s about alignment-faking.)
  Sometimes, I happen to be a person in a position of being able to share information that needs to be shared. I really dislike having to write posts about it, when the information is critical of people. Some at Lighthaven can attest to my very sad reaction to their congratulations on this post: I’m sad that the world is such that the post exists, and don’t feel good about having written it, and don’t like finding myself in a position where no one else is doing something and someone has.
  - Richard_Ngo 30 Nov 2025 22:57 UTC
    18 points
    4
    Parent
    Sometimes, conclusions don’t need to be particularly nuanced. Sometimes, a system is built of many parts, and yet a valid, non-misleading description of that system as a whole is that it is untrustworthy.
    The central case where conclusions don’t need to be particularly nuanced is when you’re engaged in a conflict and you’re trying to attack the other side.
    In other cases, when you’re trying to figure out how the world works and act accordingly, nuance typically matters a lot.
    Calling an organization “untrustworthy” is like calling a person “unreliable”. Of course some people are more reliable than others, but when you smuggle in implicit binary standards you are making it harder in a bunch of ways to actually model the situation.
    I sent Mikhail the following via DM, in response to his request for “any particular parts of the post [that] unfairly attack Anthropic”:
    I think that the entire post is optimized to attack Anthropic, in a way where it’s very hard to distinguish between evidence you have, things you’re inferring, standards you’re implicitly holding them to, standards you’re explicitly holding them to, etc.
    
    My best-guess mental model here is that you were more careful about this post than about the other posts, but that there’s a common underlying generator to all of them, which is that you’re missing some important norms about how healthy critique should function.
    
    I don’t expect to be able to convey those norms or their importance to you in this exchange, but I’ll consider writing up a longform post about them.
    I think Situational Awareness is a pretty good example of what it looks like for an essay to be optimized for a given outcome at the expense of epistemic quality. In Situational Awareness, it’s less that any given statement is egregiously false, and more that there were many choices made to try to create a conceptual frame that promoted racing. I have critiqued this at various points (and am writing up a longer critique) but what I wanted from Leopold was something more like “here are the key considerations in my mind, here’s how I weigh them up, here’s my nuanced conclusion, here’s what would change my mind”. And that’s similar to what I want from posts like yours too.
    - Bronson Schoen 1 Dec 2025 14:58 UTC
      11 points
      7
      Parent
      This seems focused on intent in a way that’s IMO orthogonal to the post. There’s explicit statements that Anthropic made and then violated. Bringing in intent (or especially nationality) and then pivoting to discourse norms seems on net bad for figuring out “should you assume this lab will hold to commitments in the future when there are incentives for them not to”.
      - cdt 1 Dec 2025 18:54 UTC
        8 points
        8
        Parent
        I particularly dislike that this topic has stretched into psychoanalysis (of Anthropic staff, of Mikhail Samin, of Richard Ngo) when I felt that the best part of this article was its groundedness in fact and nonreliance on speculation. Psychoanalysis of this nature is of dubious use and pretty unfriendly.
        Any decision to work with people you don’t know personally that relies on guessing their inner psychology is doomed to fail.
- cdt 30 Nov 2025 22:04 UTC
  3 points
  1
  Parent
  The post contains one explicit call-to-action:
  If you are considering joining Anthropic in a non-safety role, I ask you to, besides the general questions, carefully consider the evidence and ask yourself in which direction it is pointing, and whether Anthropic and its leadership, in their current form, are what they present themselves as and are worthy of your trust.
  If you work at Anthropic, I ask you to try to better understand the decision-making of the company and to seriously consider stopping work on advancing general AI capabilities or pressuring the company for stronger governance.
  This targets a very small proportion of people who read this article. Is there another way we could operationalize this work, one that targets people who aren’t working/aiming to work at Anthropic?
- Mikhail Samin 1 Dec 2025 18:09 UTC
  1 point
  0
  Parent
  I expect that a bunch of this post is “spun” in uncharitable ways.
  That is, I think of the post as primarily trying to do the social move of “lower trust in Anthropic” rather than the epistemic move of “try to figure out what’s up with Anthropic”. The latter would involve discussion of considerations like: sometimes lab leaders need to change their minds. To what extent are disparities in their statements and actions evidence of deceptiveness versus changing their minds? Etc. More generally, I think of good critiques as trying to identify standards of behavior that should be met, and comparing people or organizations to those standards, rather than just throwing accusations at them.
  
  “I think a bunch of this comment is fairly uncharitable.”
  The first was of Oliver Habryka. I feel pretty confident that this was a bad critique, which overstated its claims on the basis of pretty weak evidence.
  I’m curious if this post was also (along with the Habryka critique) one of Mikhail’s daily Inkhaven posts. If so it seems worth thinking about whether there are types of posts that should be written much more slowly, and which Inkhaven should therefore discourage from being generated by the “ship something every day” process.
  
  For reference, the other person I’ve drawn the most similar conclusion about was Alexey Guzey (e.g. of his critiques here, here, and in some internal OpenAI docs). I notice that he and Mikhail are both Russian. I do have some sympathy for the idea that in Russia it’s very appropriate to assume a lot of bad faith from power structures, and I wonder if that’s a generator for these critiques.
  “That is, I think of the comment as primarily trying to do the social move of “lower trust in what Mikhail says” rather than the epistemic move of “figure out what’s up with Mikhail”. The latter would involve considerations like: to what extent disparities between your state of knowledge and Mikhail’s other posts evidence of being uncharitable vs. having different sets of information and trying to share the information? Etc. More generally, I think of good critiques as trying to identify standards of behavior that should be met, and comparing people to those standards, rather than just throwing accusations at them.”
  I’d much rather the discussion was about the facts and not about people or conversational norms.
  - Raemon 1 Dec 2025 18:14 UTC
    3 points
    0
    Parent
    (downvoted because you didn’t actually spell out what point you’re making with that rephrase. You think nobody should ever call people out for doing social moves? You think Richard didn’t do a good job with it?)
    - Mikhail Samin 1 Dec 2025 18:19 UTC
      2 points
      0
      Parent
      Somewhat valid, thanks; I added quotes with examples.
      - Raemon 1 Dec 2025 20:15 UTC
        4 points
        3
        Parent
        This didn’t really do what I wanted. For starters, literally quoting Richard is self-defeating – either it’s reasonable to make this sort of criticism, or it’s not. If you think there is something different between your post and Richard’s comment, I don’t know what it is and why you’re doing the reverse-quote except to be sorta cute.
        I don’t even know why you think Richard’s comment is “primarily doing the social move of lower trust in what Mikhail says”. Richard’s comment gives examples of why he thinks that about your post, you don’t explain what you think is charitable about his.
        I think it is necessary sometimes to argue that people are being uncharitable, and looking they are doing a status-lowering move more than earnest truthseeking.
        I haven’t actually looked at your writing and don’t have an opinion I’d stand by, but from my passing glances at it I did think Richard’s comment seemed to be pointing at an important thing.
        Mikhail Samin 2 Dec 2025 11:41 UTC
        2 points
        0
        Parent
        I attempted to demonstrate Richard’s criticism is not reasonable, as some parts of it are not reasonable according to its own criteria.
        (E.g., he did not describe how I should’ve approached the Lightcone Infrastructure post better.)
        To be crystal clear, I do not endorse this kind of criticism.