Mikhail Samin comments on Mikhail Samin’s Shortform

Mikhail Samin 9 Jun 2025 23:29 UTC
89 points
39
Nobody at Anthropic can point to a credible technical plan for actually controlling a generally superhuman model. If it’s smarter than you, knows about its situation, and can reason about the people training it, this is a zero-shot regime.
The world, including Anthropic, is acting as if “surely, we’ll figure something out before anything catastrophic happens.”
That is unearned optimism. No other engineering field would accept “I hope we magically pass the hardest test on the first try, with the highest stakes” as an answer. Just imagine if flight or nuclear technology were deployed this way. Now add having no idea what parts the technology is made of. We’ve not developed fundamental science about how any of this works.
As much as I enjoy Claude, it’s ordinary professional ethics in any safety-critical domain: you shouldn’t keep shipping SOTA tech if your own colleagues, including the CEO, put double-digit chances on that tech causing human extinction.
You’re smart enough to know how deep the gap is between current safety methods and the problem ahead. Absent dramatic change, this story doesn’t end well.
In the next few years, the choices of a technical leader in this field could literally determine not what the future looks like, but whether we have a future at all.
If you care about doing the right thing, now is the time to get more honest and serious than the prevailing groupthink wants you to be.
- Buck 10 Jun 2025 1:19 UTC
  61 points
  28
  Parent
  I think it’s accurate to say that most Anthropic employees are abhorrently reckless about risks from AI (though my guess is that this isn’t true of most people who are senior leadership or who work on Alignment Science, and I think that a bigger fraction of staff are thoughtful about these risks at Anthropic than other frontier AI companies). This is mostly because they’re tech people, who are generally pretty irresponsible. I agree that Anthropic sort of acts like “surely we’ll figure something out before anything catastrophic happens”, and this is pretty scary.
  I don’t think that “AI will eventually pose grave risks that we currently don’t know how to avert, and it’s not obvious we’ll ever know how to avert them” immediately implies “it is repugnant to ship SOTA tech”, and I wish you spelled out that argument more.
  I agree that it would be good if Anthropic staff (including those who identify as concerned about AI x-risk) were more honest and serious than the prevailing Anthropic groupthink wants them to be.
- Fabien Roger 10 Jun 2025 9:17 UTC
  26 points
  18
  Parent
  What if someone at Anthropic thinks P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30%? Then the obvious alternatives are to do their best to get governments / international agreements to make everyone pause or to make everyone’s AI development safer, but it’s not completely obvious that this is a better strategy because it might not be very tractable. Additionally, they might think these things are more tractable if Anthropic is on the frontier (e.g. because it does political advocacy, AI safety research, and deploys some safety measures in a way competitors might want to imitate to not look comparatively unsafe). And they might think these doom-reducing effects are bigger than the doom-increasing effects of speeding up the race.
  You probably disagree with P(doom|some other company builds AGI) - P(doom|Anthropic builds AGI) and with the effectiveness of Anthropic advocacy/safety research/safety deployments, but I feel like this is a very different discussion from “obviously you should never build something that has a big chance of killing everyone”.
  (I don’t think most people at Anthropic think like that, but I believe at least some of the most influential employees do.)
  Also my understanding is that technology is often built this way during deadly races where at least one side believes that them building it faster is net good despite the risks (e.g. deciding to fire the first nuke despite thinking it might ignite the atmosphere, …).
  - Mikhail Samin 11 Jun 2025 15:00 UTC
    13 points
    16
    Parent
    If this is their belief, they should state it and advocate for the US government to prevent everyone in the world, including them, from building what has a double-digit chance of killing everyone. They’re not doing that.
  - Charbel-Raphaël 11 Jun 2025 11:32 UTC
    11 points
    6
    Parent
    P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30% --> You need to add to this the probability that Anthropic is first and that the other companies are not going to create AGI if Anthropic already created it. this is by default not the case
    - Fabien Roger 11 Jun 2025 14:05 UTC
      2 points
      0
      Parent
      I agree, the net impact is definitely not the difference between these numbers.
      Also I meant something more like P(doom|Anthropic builds AGI first).I don’t think people are imagining that the first AI company to achieve AGI will have an AGI monopoly forever. Instead some think it may have a large impact on what this technology is first used for and what expectations/regulations are built around it.
- Zach Stein-Perlman 10 Jun 2025 0:08 UTC
  11 points
  13
  Parent
  It would be easier to argue with you if you proposed a specific alternative to the status quo and argued for it. Maybe “[stop] shipping SOTA tech” is your alternative If so: surely you’re aware of the basic arguments for why Anthropic should make powerful models; maybe you should try to identify cruxes.
  - habryka 10 Jun 2025 3:33 UTC
    50 points
    21
    Parent
    Separately from my other comment: It is not the case that the only appropriate thing to do when someone is going around killing your friends and your family and everyone you know is to “try to identify cruxes”.
    It’s eminently reasonable for people to just try to stop whatever is happening, which includes intention for social censure, convincing others, and coordinating social action. It is not my job to convince Anthropic staff they are doing something wrong. Indeed, the economic incentives point extremely strongly towards Anthropic staff being the hardest to convince of true beliefs here. The standard you invoke here seems pretty crazy to me.
    What links here?
    Knight Lee's comment on Consider chilling out in 2028 by Valentine (23 Jun 2025 6:02 UTC; 2 points)
    - Neel Nanda 10 Jun 2025 9:42 UTC
      24 points
      6
      Parent
      It is not clear to me that Anthropic “unilaterally stopping” will result in meaningfully better outcomes than the status quo, let alone that it would be anywhere near the best way for Anthropic to leverage its situation.
      - Vaniver 11 Jun 2025 6:22 UTC
        26 points
        10
        Parent
        I do think there’s a Virtue of Silence problem here.
        Like—I was a ML expert who, roughly ten years ago, decided to not advance capabilities and instead work on safety-related things, and when the returns to that seemed too dismal stopped doing that also. How much did my ‘unilateral stopping’ change things? It’s really hard to estimate the counterfactual of how much I would have actually shifted progress; on the capabilities front I had several ‘good ideas’ years early but maybe my execution would’ve sucked, or I would’ve been focused on my bad ideas instead. (Or maybe me being at the OpenAI lunch table and asking people good questions would have sped the company up by 2%, or w/e, independent of my direct work.)
        How many people are there like me? Also not obvious, but probably not that many. (I would guess most of them ended up in the MIRI orbit and I know them, but maybe there are lurkers—one of my friends in SF works for generic tech companies but is highly suspicious of working for AI companies, for reasons roughly downstream of MIRI, and there might easily be hundreds of people in that boat. But maybe the AI companies would only actually have wanted to hire ten of them, and the others objecting to AI work didn’t actually matter.)
      - Thane Ruthenis 10 Jun 2025 22:08 UTC
        26 points
        7
        Parent
        It is not clear to me that Anthropic “unilaterally stopping” will result in meaningfully better outcomes than the status quo
        I think that just Anthropic, OpenAI, and DeepMind stopping would plausibly result in meaningfully better outcomes than the status quo. I still see no strong evidence that anyone outside these labs is actually pursuing AGI with anything like their level of effectiveness. I think it’s very plausible that everyone else is either LARPing (random LLM startups), or largely following their lead (DeepSeek/China), or pursuing dead ends (Meta’s LeCun), or some combination.
        The o1 release is a good example. Yes, everyone and their grandmother was absent-mindedly thinking about RL-on-CoTs and tinkering with relevant experiments. But it took OpenAI deploying a flashy proof-of-concept for everyone to pour vast resources into this paradigm. In the counterfactual where the three major labs weren’t there, how long would it have taken the rest to get there?
        I think it’s plausible that if only those three actors stopped, we’d get +5-10 years to the timelines just from that. Which I expect does meaningfully improve the outcomes, particularly in AI-2027-style short-timeline worlds.
        So I think getting any one of them to individually stop would be pretty significant, actually (inasmuch as it’s a step towards “make all three stop”).
        Vaniver 11 Jun 2025 6:24 UTC
        23 points
        14
        Parent
        I think more than this, when you look at the labs you will often see the breakthru work was done by a small handful of people or a small team, whose direction was not popular before their success. If just those people had decided to retire to the tropics, and everyone else had stayed, I think that would have made a huge difference to the trajectory. (What does it look like if Alec Radford had decided to not pursue GPT? Maybe the idea was ‘obvious’ and someone else gets it a month later, but I don’t think so.)
      - habryka 10 Jun 2025 19:30 UTC
        14 points
        12
        Parent
        I see no principle by which I should allow Anthropic to build existentially dangerous technology, but disallow other people from building it. I think the right choice is for no lab to build it. I am here not calling for particularly much censure of Anthropic compared to all labs, and my guess is we can agree that in aggregate building existentially dangerous AIs is bad and should face censure.
      - Ben Pace 10 Jun 2025 16:17 UTC
        3 points
        3
        Parent
        If you are killing me and my friends because you think it better that you do the killing than someone else, then actually I will still ask you to stop, because I draw a hard line around killing me and my friends. Naturally, I have a similar line around developing tech that will likely kill me and my friends.
        Neel Nanda 10 Jun 2025 22:56 UTC
        7 points
        3
        Parent
        I think this would fail Anthropic’s ideological Turing test. For example, they might make arguments like: by being a frontier lab, they can push for impactful regulation in a way they couldn’t if they weren’t; they can set better norms and demonstrate good safety practices that get adopted by others; or they can conduct better safety research that they could not do without access to frontier models. It’s totally reasonable to disagree with this, or argue that their actions so far (e.g., lukewarm support and initial opposition to SB 1047) show that they are not doing this, but I don’t think these arguments are, in principle, ridiculous.
        Mikhail Samin 14 Jun 2025 11:26 UTC
        8 points
        4
        Parent
        Yeah, sorry, I think it’s just very tricky for me to pass Anthropic’s ITT, because to imitate Anthropic, I would need to be concurrently saying stuff like “by being a frontier lab, we can push for impactful regulation”, typing stuff like “this bill will impose multi-million dollar fines for minor, technical violations, representing a risk to smaller companies” about a NY bill with requirements only for $100m+ training runs that would not impose multi-million dollar fine for minor violations, and misleading a part of me about Dario’s role (he is the Anthropic’s politics and policy lead and was a lot more involved in SB 1047 than many at Anthropic think).
        It’s generally harder to pass ITT of an entity that lies to itself and others than to point out why it is incoherent and ridiculous.
        In my mind, a good predictor of Anthropic’s actions is something in the direction of “a bunch of Sam Altmans stuck with potentially unaligned employees (who care about x-risk), going hard on trying to win the race”.
        What links here?
        Neel Nanda's comment on Mikhail Samin’s Shortform by Mikhail Samin (14 Jun 2025 18:59 UTC; 2 points)
        Neel Nanda 14 Jun 2025 15:37 UTC
        2 points
        0
        Parent
        I disagree, but this doesn’t feel like a productive discussion, so I’ll leave things there
        
        Do you have a source for Anthropic comments on the NY bill? I couldn’t find them and that one is news to me
        Mikhail Samin 14 Jun 2025 15:42 UTC
        4 points
        0
        Parent
        A bill passed two chambers of New York State legislature. It incorporated a lot of feedback from this community. This bill’s author actually talked about it as a keynote speaker at an event organized by FAR at the end of May.
        There’s no good theory of change for Anthropic compatible with them opposing and misrepresenting this bill. If you work at Anthropic on AI capabilities, you should stop.
        From Jack Clark:
        We’ve given some feedback to this bill, like we do with many bills both at federal and state level. Despite improvements, we continue to have some concerns
        (Many such cases!)
        - RAISE is overly broad/unclear in some of its key definitions which makes it difficult to know how to comply
        - If the state believes there is a compliance deficiency in a lab’s safety plan, it’s not clear you’d get an opportunity to correct it before enforcement kicks in
        - Definition of ‘safety incident’ is extremely broad/unclear and the turnaround time is v short (72 hours!). This could make for lots of unnecessary over-reporting that distracts you from actual big issues
        - It also appears multi-million dollar fines could be imposed for minor, technical violations—this represents a real risk to smaller companies
        As we’ve been saying for some time (last year: anthropic.com/news/the-case-…) and recently (nytimes.com/2025/06/05/opi…) we think it’d be great if there could be a federal transparency standard
        If there isn’t anything at the federal level, we’ll continue to engage on bills at the state level—but as this thread highlights, this stuff is complicated.
        Any state proposals should be narrowly focused on transparency and not overly prescriptive. Ideally there would be a single rule for the country.
        Here’s what the bill’s author says in response:
        Jack, Anthropic has repeatedly stressed the urgency and importance of the public safety threats it’s addressing, but those issues seem surprisingly absent here.
        Unfortunately, there’s a fair amount in this thread that is misleading and/or inflammatory, especially “multi-million dollar fines could be imposed for minor, technical violations—this represents a real risk to smaller companies.”
        An army of lobbyists are painting RAISE as a burden for startups, and this language perpetuates that falsehood. RAISE only applies to companies that are spending over $100M on compute for the final training runs of frontier models, which is a very small, highly-resourced group.
        In addition, maximum fines are typically only applied by courts for severe violations, and it’s scaremongering to suggest that the largest penalties will apply to minor infractions.
        The 72 hour incident reporting timeline is the same as the cyber incident reporting timeline in the financial services industry, and only a short initial report is required.
        AG enforcement + right to cure is effectively toothless, could lead to uneven enforcement, and seems like a bad idea given the high stakes of the issue.
        Ben Pace 11 Jun 2025 2:41 UTC
        4 points
        6
        Parent
        I’m not saying that it’s implausible that the consequences might seem better. I’m stating it’s still morally wrong to race toward causing a likely extinction-level event as that’s a pretty schelling place for a deontological lines against action.
        Neel Nanda 11 Jun 2025 8:18 UTC
        11 points
        1
        Parent
        Ah. In that case we just disagree about morality. I am strongly in favour of judging actions by their consequences, especially for incredibly high stakes actions like potential extinction level events. If an action decreases the probability of extinction I am very strongly in favour of people taking it.
        
        I’m very open to arguments that the consequences would be worse, that this is the wrong decision theory, etc, but you don’t seem to be making those?
        Ben Pace 11 Jun 2025 23:13 UTC
        24 points
        23
        Parent
        I too believe we should ultimately judge things based on their consequences. I believe that having deontological lines against certain actions is something that leads humans to make decisions with better consequences, partly because we are bounded agents that cannot well-compute the consequences of all of our actions.
        For instance, I think you would agree that it would be wrong to kill someone in order to prevent more deaths, today here in the Western world. Like, if an assassin is going to kill two people, but says if you kill one then he won’t kill the other, if you kill that person you should still be prosecuted for murder. It is actually good to not cross these lines even if the local consequentialist argument seems to check out. I make the same sort of argument for being first in the race toward an extinction-level event. Building an extinction-machine is wrong, and arguing you’ll be slightly more likely to pull back first does not stop it from being something you should not do.
        I think when you look back at a civilization that raced to the precipice and committed auto-genocide, and ask where the lines in the sand should’ve been drawn, the most natural one will be “building the extinction machine, and competing to be first to do so”. So it is wrong to cross this line, even for locally net positive tradeoffs.
        What links here?
        Ben Pace's comment on Mikhail Samin’s Shortform by Mikhail Samin (12 Jun 2025 0:57 UTC; 2 points)
        Neel Nanda 14 Jun 2025 18:18 UTC
        15 points
        3
        Parent
        I think this just takes it up one level of meta. We are arguing about the consequences of a ruleset. You are arguing that your ruleset has better consequences, while others disagree. And so you try to censure these people—this is your prerogative, but I don’t think this really gets you out of the regress of people disagreeing about what the best actions are.
        
        Engaging with the object level of whether your proposed ruleset is a good one, I feel torn.
        
        For your analogy of murder, I am very pro-not-murdering people, but I would argue this is convergent because it is broadly agreed upon by society. We all benefit from it being part of the social contract, and breaking that erodes the social contract in a way that harms all involved. If Anthropic unilaterally stopped trying to build AGI, I do not think this would significantly affect other labs, who would continue their work, so this feels disanalogous.
        
        And it is reasonable in extreme conditions (e.g. when those prohibitions are violated by others acting against you) to abandon standard ethical prohibitions. For example, I think it was just for Allied soldiers to kill Nazi soldiers in World War II. I think having nuclear weapons is terrible and questionable but I generally don’t support countries unilaterally abandoning their nuclear weapons, leaving them vulnerable to other nuclear-armed nations. Obviously, there are many disanalogies, but my point is that you need to establish how much a given deontological prohibition makes sense in unusual situations, rather than just appealing to moral intuition.
        
        I’m not here to defend Anthropic’s actions on the object level—they are not acting as I would in their situation, but they may have sound reasons. But they are not acting badly enough that I confidently assume bad faith. They have had positive effects, like their technical research and helping RSPs become established, though I disagree with some of their policy positions.
        
        Another disanalogy between this and murder is that there are multiple AGI labs, and only one needs to cause human extinction. If Anthropic ceased to exist, other labs would continue this work. I’d argue that Anthropic is accelerating development by researching capabilities and intensifying commercial pressure, and this is bad. But when arguing about acceleration’s harm, we must weigh it against Anthropic’s potential good—this becomes more of an apples-to-apples comparison rather than a clear deontological violation.
        Ben Pace 14 Jun 2025 19:45 UTC
        10 points
        7
        Parent
        If Anthropic unilaterally stopped trying to build AGI, I do not think this would significantly affect other labs, who would continue their work, so this feels disanalogous.
        Not a crux for either of us, but I disagree. When is the last time that someone shut down a multi-billion dollar profit arm of a company due to ethics, and especially due to the threat of extinction? If Anthropic announced they were ceasing development / shutting down because they did not want to cause an extinction-level event, this would have massive ramifications through society as people started to take this consequence more seriously, and many people would become more scared, including friends of employees at the other companies and more of the employees themselves. This would have massive positive effects.
        For your analogy of murder, I am very pro-not-murdering people, but I would argue this is convergent because it is broadly agreed upon by society. We all benefit from it being part of the social contract, and breaking that erodes the social contract in a way that harms all involved.
        This implies one should never draw lines in the sand about good/bad behavior if society has not reached consensus on it. In contrast, I think it is good to not do many behaviors even if your society has not yet reached consensus on it. For instance, if a government has not yet regulated that language-models shouldn’t encourage people to kill themselves, and then language models do and 1000s of ppl die (NB: this is a fictional example), this isn’t ethically fine just because it wasn’t illegal. I think we should act in ways that we believe will make sense as policies even before they have achieved consensus, and this is part of what makes someone engaged in ethics rather than in simply “doing what you are told”.
        You bring up Nazism. I think that it was wrong to go along with Nazism even though the government endorsed it. Surely there are ethical lines against causing an extinction-level event even if your society has not come to a consensus on where those lines are yet. And even if we never achieve consensus, everyone should still attempt to figure out the answer and live by it, rather than give up on having such ethical lines.
        I’m not here to defend Anthropic’s actions on the object level—they are not acting as I would in their situation, but they may have sound reasons. But they are not acting badly enough that I confidently assume bad faith. They have had positive effects, like their technical research and helping RSPs become established, though I disagree with some of their policy positions.
        Habryka wrote about how the bad-faith comment was a non-sequiter in another thread. I will here say that the “I’m not here to defend Anthropic’s actions on the object level” doesn’t make sense to me. I am saying they should stop racing, and you are saying they should not, and we are exchanging arguments for this, currently coming down to the ethics of racing toward an extinction-level event and whether there are deontological lines against doing that. I agree that you are not attempting to endorse all the details of what they are doing beyond that, but I believe you are broadly defending their IMO key object-level action of doing multi-billion dollar AI capabilities research and building massive industry momentum.
        You are arguing that your ruleset has better consequences, while others disagree. And so you try to censure these people—this is your prerogative, but I don’t think this really gets you out of the regress of people disagreeing about what the best actions are.
        It reads to me that you’re just talking around the point here. I said that people shouldn’t race toward extinction-level threats for deontological reasons, you said we should evaluate the direct consequences, I said deontological reasons are endorsed by a consequentialist framework so we should analyze it deontologically, and now you’re saying that I’m conceding the initial point that we should be doing the consequentialist analysis. No, I’m saying we should do a deontological analysis, and this is in conflict with you saying we should just judge based on the direct consequences that we know how to estimate.
        I’d argue that Anthropic is accelerating development by researching capabilities and intensifying commercial pressure, and this is bad. But when arguing about acceleration’s harm, we must weigh it against Anthropic’s potential good—this becomes more of an apples-to-apples comparison rather than a clear deontological violation.
        You keep trying to engage me in this consequentialist analysis, and say that sometimes (e.g. during times of war) the deontological rules can have exceptions, but you have not argued for why this is an exception. If people around you in society start committing murder, would you then start murdering? If people around you started lying, would you then start lying? I don’t think so. Why then, if people around you are racing to an extinction-level event, does the obvious rule of “do not race toward an extinction-level event” get an exception? Other people doing things that are wrong (even if they get away with it!) doesn’t make those things right.
        Expand this thread
        Neel Nanda 14 Jun 2025 21:49 UTC
        6 points
        1
        Parent
        The point I was trying to make is that, if I understood you correctly, you were trying to appeal to common sense morality that deontological rules like this are good on consequentialist grounds. I was trying to give examples why I don’t think this immediately follows and you need to actually make object level arguments about this and engage with the counter arguments. If you want to argue for deontological rules, you need to justify why those rules
        
        I am not trying to defend the claim that I am highly confident that what Anthropic is doing is ethical and net good for the world, but I am trying to defend the claim that there are vaguely similar plans to Anthropics that I would predict are net good in expectation, e.g., becoming a prominent actor then leveraging your influence to push for good norms and good regulations. Your arguments would also imply that plans like that should be deontologically prohibited and I disagree.
        
        I don’t think this follows from naive moral intuition. A crucial disanalogy with murder is that if you don’t kill someone, the counterfactual is that the person is alive. While if you don’t race towards AGI, the counterfactual is that maybe someone else makes it and we die anyway. This means that we need to be engaging in discussion about the consequences of there being another actor pushing for this, the consequences of other actions this actor may take, and how this all nets out, which I don’t feel like you’re doing.
        
        I expect AGI to be either the best or worse thing that has ever happened, and this means that important actions will typically be high variance, with major positive or negative consequences. Declining to engage in things with the potential for high negative consequences severely restricts your action space. And given that it’s plausible that there’s a terrible outcome even if we do nothing, I don’t think the act-omission distinction applies
        Ben Pace 14 Jun 2025 22:30 UTC
        6 points
        4
        Parent
        I am not trying to defend the claim that I am highly confident that what Anthropic is doing is ethical and net good for the world, but I am trying to defend the claim that there are vaguely similar plans to Anthropics that I would predict are net good in expectation, e.g., becoming a prominent actor then leveraging your influence to push for good norms and good regulations. Your arguments would also imply that plans like that should be deontologically prohibited and I disagree.
        Thank you for clarifying, I think I understand now. I’m hearing you’re not arguing in defense of Anthropic’s specific plan but in defense of there being some part of the space of plans being good that involve racing to build something that has a (say) >20% chance of causing an extinction-level event, that Anthropic may or may not fall into.
        A crucial disanalogy with murder is that if you don’t kill someone, the counterfactual is that the person is alive. While if you don’t race towards AGI, the counterfactual is that maybe someone else makes it and we die anyway.
        This isn’t disanalagous. As I have already said in this thread, you are not allowed to murder someone even if someone else is planning to murder them. If you find out multiple parties are going to murder Bob, you are not now allowed to murder Bob in a way that is slightly less likely to be successful.
        Crucially it is not to be assumed that we will build AGI in the next 1-2 decades. If the countries of the world decided to ban training runs of a particular size, because we don’t want to take this sort of extinction-level risk, then it would not happen. Assuming this out of the hypothesis space will get you into bad ethical territory. Suppose a military general says “War is inevitable, the only question is how fast it’s over when it starts and how few deaths there are.” This general would never take responsibility for instigating. Similarly if you assume with certainty that AGI will be developed risking in next few decades, you absolve yourself of all responsibility for being the one who does so.
        Declining to engage in things with the potential for high negative consequences severely restricts your action space.
        I think you are failing to understand the concept of deontology by replacing “breaks deontological rules” with “highly negative consequences”. Deontology doesn’t say “you can tell a lie if it saves you from telling two lies later” or “lying is wrong unless you get a lot of money for it”. It says “don’t tell lies”. There are exceptional circumstances for all rules, but unless you’re in an exceptional circumstance, you treat them as rules, and don’t treat violations as integers to be traded against each other.
        
        When the stakes get high it is not time to start lying, cheating, killing, or unilaterally betting the extinction of the human race. If it is for someone, then they simply can’t be trusted to follow these ethical principles when it matters.
        Neel Nanda 15 Jun 2025 0:50 UTC
        9 points
        2
        Parent
        
        Thank you for clarifying, I think I understand now. I’m hearing you’re not arguing in defense of Anthropic’s specific plan but in defense of there being some part of the space of plans being good that involve racing to build something that has a (say) >20% chance of causing an extinction-level event, that Anthropic may or may not fall into.
        
        Yes that is correct
        
        This isn’t disanalagous. As I have already said in this thread, you are not allowed to murder someone even if someone else is planning to murder them. If you find out multiple parties are going to murder Bob, you are not now allowed to murder Bob in a way that is slightly less likely to be successful.
        
        I disagree. If a patient has a deadly illness then I think it is fine for a surgeon to perform a dangerous operation to try to save their life. I think the word murder is obfuscating things and suggest we instead talk in terms of “taking actions that may lead to death”, which I think is more analogous—hopefully we can agree Anthropic won’t intentionally cause human extinction. I think it is totally reasonable to take actions that net decrease someone’s probability of dying, while introducing some novel risks.
        
        I think you are failing to understand the concept of deontology by replacing “breaks deontological rules” with “highly negative consequences”. Deontology doesn’t say “you can tell a lie if it saves you from telling two lies later” or “lying is wrong unless you get a lot of money for it”. It says “don’t tell lies”. There are exceptional circumstances for all rules, but unless you’re in an exceptional circumstance, you treat them as rules, and don’t treat violations as integers to be traded against each other.
        
        I think we’re talking past each other. I understood you as arguing “deontological rules against X will systematically lead to better consequences than trying to evaluate each situation carefully, because humans are fallible”. I am trying to argue that your proposed deontological rule does not obviously lead to better consequences as an absolute rule. Please correct me if I have misunderstood.
        
        I am arguing that “things to do with human extinction from AI, when there’s already a meaningful likelihood” are not a domain where ethical prohibitions like “never do things that could lead to human extinction” are productive. For example, you help run LessWrong, which I’d argue has helped raise the salience of AI x-risk, which plausibly has accelerated timelines. I personally think this is outweighed by other effects, but that’s via reasoning about the consequences. Your actions and Anthropic’s feel more like a difference in scale than a difference in kind.
        
        Assuming this out of the hypothesis space will get you into bad ethical territory
        
        I am not arguing that AI x-risk is inevitable, in fact I’m arguing the opposite. AI x-risk is both plausible and not inevitable. Actions to reduce this seem very valuable. Actions that do this will often have side effects that increase risk in other ways. In my opinion, this is not sufficient cause to immediately rule them out.
        
        Meanwhile, I would consider anyone pushing hard to make frontier AI to be highly reckless if they were the only one who could cause extinction, and they could unilaterally stop—this is a way to unilaterally bring risk to zero, which is better than any other action. But Anthropic has no such action available, and so I want them to take the actions that reduce risk as much as possible. And there are arguments for proceeding and arguments for stopping.
        Ben Pace 15 Jun 2025 19:17 UTC
        8 points
        6
        Parent
        > As I have already said in this thread, you are not allowed to murder someone even if someone else is planning to murder them. If you find out multiple parties are going to murder Bob, you are not now allowed to murder Bob in a way that is slightly less likely to be successful.
        I disagree. If a patient has a deadly illness then I think it is fine for a surgeon to perform a dangerous operation to try to save their life. I think the word murder is obfuscating things and suggest we instead talk in terms of “taking actions that may lead to death”, which I think is more analogous—hopefully we can agree Anthropic won’t intentionally cause human extinction. I think it is totally reasonable to take actions that net decrease someone’s probability of dying, while introducing some novel risks.
        This is simplifying away key details.
        If you go up to a person with a deadly illness and non-consensually do a dangerous surgery on them, this is wrong. If you kill them via this, their family has a right to sue you / prosecute you for murder. Once again, simply because some bad outcome is likely, you do not have ethical mandate to now go and cause it yourself. Deontology is typically about forbidding classes of action that on net make the world worse even when locally you have a good reason. Talking about “taking actions that lead to death” explicitly obfuscates the mechanism. I know you won’t endorse this once I point it out, but under this strictly-consequentialist framework “blogging on LessWrong about extinction-risk from AI” and “committing murder” are just two different “actions that lead to death” and neither can be thought of as having different deontological lines drawn. On the contrary, “don’t commit murder” and “don’t build a doomsday machine” are simple and natural deontological rules, whereas “don’t build a blogging platform with unusually high standards for truthseeking” is not.
        I am trying to argue that your proposed deontological rule does not obviously lead to better consequences as an absolute rule. Please correct me if I have misunderstood.
        I am not trying to argue for an especially novel deontological rule… “building a doomsday machine” is wrong. It’s a far greater sin than murder. I think you’d do better to think of the AI companies as more like competing political factions each of whom’s base is very motivated toward committing a genocide against their neighbors. If your political faction commits a genocide; and you were merely a top-200 ranked official who didn’t particularly want a genocide, you still bear moral responsibility for it even though you only did paperwork and took meetings and maybe worked in a different department. Just because there are two political factions whose bases are uncomfortably attracted to the idea of committing genocide does not now make it ethically clear for you to make a third one that hungers for genocide but has wiser people in charge.
        I am not advocating for some new interesting deontological rule. I am arguing that the obvious rule against building a doomsday machine applies here straightforwardly. Deontological violations don’t stop being bad just because other people are committing them. You cannot commit murder just because other people do, and you cannot build a doomsday machine just because other people are. You generally shouldn’t build doomsday machines even if you have a good reason. To argue against this you should show why deontological rules break down, and then apply it to this case, but the doctor example you gave doesn’t show that, because by-default you aren’t actually allowed to non-consensually do risky surgeries on people even if it makes sense on the consequentialist calculus.
        Neel Nanda 15 Jun 2025 19:58 UTC
        14 points
        3
        Parent
        I continue to feel like we’re talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there’s a really good reason breaking them seems locally beneficial, because on average, the decision theory that’s willing to do harmful things for complex reasons performs badly.
        
        The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone’s consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.
        
        Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.
        
        When analysing the reasons I broadly buy the deontological framework for “don’t commit murder”, I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.
        
        Regarding your examples, I’m completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.
        
        Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I’m strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what’s disanalogous?
        
        COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.
        
        You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.
        
        Again, I’m purely defending the abstract point of “plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden”. You’re welcome to critique Anthropic’s actual actions as much as you like. But you seem to be making a much more general claim.
        Ben Pace 16 Jun 2025 0:23 UTC
        21 points
        24
        Parent
        If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there’s a really good reason breaking them seems locally beneficial, because on average, the decision theory that’s willing to do harmful things for complex reasons performs badly.
        Hm… I would say that one should follow deontological rules like “don’t lie” and “don’t steal” and so on because we fail to understand or predict the knock-on consequences. For instance they can get the world into a much worse equilibrium of mutual liars/stealers, for instance, in ways that are hard to predict. And being a good person can get the world into a much better equilibrium of mutually-honorable people in ways that are hard to predict. And also because, if it does screw up in some hard to predict way, then when you look back, it will often be the easiest line in the sand to draw.
        
        For instance, if SBF is wondering at what point he could have most reliably intervened on his whole company collapsing and ruining the reputation of things associated with it, he might talk about certain deals he made or strategic plays with Binance or the US Govt, for he is not a very ethical person; I would talk about not taking customer deposits.
        If and when we get to an endgame where tons of AI systems are sociopathically lying and stealing money and ultimately killing the humans, I suspect people of SBF’s mindset again to talk about how the US and China should’ve played things, or how Musk should’ve played OpenAI, and how Amodei should’ve done played with DC. And I will talk about not racing to develop the unaligned AI systems in the first place.
        To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.
        I don’t really know why you think that this generalization can’t be made to things we’ve not seen before. So many things I experience haven’t been seen before in history. How many centuries have we had to develop ethical intuitions for how to write on web forums? There are still answers to these questions, and I can identify ethical and unethical behaviors, as can you (e.g. sockpuppeting, doxing, brigading, etc). There can be ethical lines in novel situations, not only historically common ones.
        Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I’m strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what’s disanalogous?
        I am not sure what I would propose if I believed Nuclear Winter was a serious existential threat; it seems plausible to me that the ethical thing would be to unilaterally disarm. I suspect that at the very least if I were a country I would openly and aggressively campaign for mutual disarmament. (If any AI capabilities company openly campaigned for making it illegal to develop AI then I suspect I would consider that plausibly quite ethical).
        I’m purely defending the abstract point of “plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden”.
        To be clear, I think you’re defending a somewhat stronger claim. You write further up thread:
        I am not trying to defend the claim that I am highly confident that what Anthropic is doing is ethical and net good for the world, but I am trying to defend the claim that there are vaguely similar plans to Anthropics that I would predict are net good in expectation, e.g., becoming a prominent actor then leveraging your influence to push for good norms and good regulations. Your arguments would also imply that plans like that should be deontologically prohibited and I disagree.
        My current stance is that all actors currently in this space are doing things prohibited by basic deontology. This is not merely an unfortunate outcome, but is a grave sin, for they are building doomsday machines, likely the greatest evil that we will ever experience in our history (regardless of if they are successful). So I want to emphasize that the boundary here is not between “better and worse plans” but between “moral murky and morally evil plans”. Insofar as you commit a genocide or worse, history should remember your names as people of shame who we must take pain never to repeat. Insofar as you played with the idea, thought you could control it, and failed, then history should still think of you this way.
        I believe we disagree over where the deontological lines are, given you are defending “vaguely similar plans to Anthropic’s”. Perhaps you could point to where you think they are? Presumably you think that a Larry Page style “this is just the next stage in evolution” indifference to human extinction AI-project would be morally wrong?
        Here’s two lines that I think might cross into being acceptable [edit: or rather, “only morally murky”] from my perspective.
        I think it might be appropriate to risk building a doomsday machine if, loudly and in-public, you told everyone “I AM BUILDING A POTENTIAL DOOMSDAY MACHINE, AND YOU SHOULD SHUT MY INDUSTRY DOWN. IF YOU DON’T THEN I WILL RIDE THIS WAVE AND ATTEMPT TO IMPROVE IT, BUT YOU REALLY SHOULD NOT LET ANYONE DO WHAT I AM DOING.” And was engaged in serious lobbying and advertising efforts to this effect.
        I think it could possibly be acceptable to build an AI capabilities company if you committed to never releasing or developing any frontier capabilities AND if all employees also committed not to leave and release frontier capabilities elsewhere AND you were attempting to use this to differential improve society’s epistemics and awareness of AI’s extinction-level threat. Though this might still cause too much economic investment into AI as an industry, I’m not sure.
        I of course do not think any current project looks superficially like these.
        Neel Nanda 16 Jun 2025 1:11 UTC
        1 point
        −1
        Parent
        
        Here’s two lines that I think might cross into being acceptable from my perspective.
        
        I think it might be appropriate to risk building a doomsday machine if, loudly and in-public, you told everyone “I AM BUILDING A POTENTIAL DOOMSDAY MACHINE, AND YOU SHOULD SHUT MY INDUSTRY DOWN. IF YOU DON’T THEN I WILL RIDE THIS WAVE AND ATTEMPT TO IMPROVE IT, BUT YOU REALLY SHOULD NOT LET ANYONE DO WHAT I AM DOING.” And was engaged in serious lobbying and advertising efforts to this effect.
        
        I think it could possibly be acceptable to build an AI capabilities company if you committed to never releasing or developing any frontier capabilities AND if all employees also committed not to leave and release frontier capabilities elsewhere AND you were attempting to use this to differential improve society’s epistemics and awareness of AI’s extinction-level threat. Though this might still cause too much economic investment into AI as an industry, I’m not sure.
        
        I of course do not think any current project looks superficially like these.
        
        Okay, after reading this it seems to me that we broadly do agree and are just arguing over price. I’m arguing that it is permissible to try to build a doomsday machine if there are really good reasons to believe it is net good for the probability of doomsday. It sounds like you agree, and give two examples of what “really good reasons” could be. I’m sure we disagree on the boundaries of where the really good reasons lie, but I’m trying to defend the point that you actually need to think about the consequences.
        
        What am I missing? Is it that you think these two are really good reasons, not because of the impact on the consequences, but because of the attitude/framing involved?
        ryan_greenblatt 16 Jun 2025 3:57 UTC
        17 points
        18
        Parent
        I’m not Ben, but I think you don’t understand. I think explaining what you are doing loudly in public isn’t like “having a really good reason to believe it is net good” is instead more like asking for consent.
        
        Like you are saying “please stop me by shutting down this industry” and if you don’t get shut down, that it is analogous to consent: you’ve informed society about what you’re doing and why and tried to ensure that if everyone else followed a similar sort of policy we’d be in a better position.
        
        (Not claiming I agree with Ben’s perspective here, just trying to explain it as I understand it.)
        Neel Nanda 16 Jun 2025 11:40 UTC
        4 points
        0
        Parent
        Ah! Thanks a lot for the explanation, that makes way more sense, and is much weaker than what I thought Ben was arguing for. Yeah this seems like a pretty reasonable position, especially “take actions where if everyone else took them we would be much better off” and I am completely fine with holding Anthropic to that bar. I’m not fully sold re the asking for consent framing, but mostly for practical reasons—I think there’s many ways that society is not able to act constantly, and the actions of governments on many issues are not a reflection of the true informed will of the people, but I expect there’s some reframe here that I would agree with.
        habryka 16 Jun 2025 16:45 UTC
        2 points
        0
        Parent
        and is much weaker than what I thought Ben was arguing for.
        I don’t think Ryan (or I) was intending to imply a measure of degree, so my guess is unfortunately somehow communication still failed. Like, I don’t think Ryan (or Ben) are saying “it’s OK to do these things you just have to ask for consent”. Ryan was just trying to point out a specific way in which things don’t bottom out in consequentialist analysis.
        If you end up walking away with thinking that Ben believes “the key thing to get right for AI companies is to ask for consent before building the doomsday machine”, which I feel like is the only interpretation of what you could mean by “weaker” that I currently have, then I think that would be a pretty deep misunderstanding.
        Neel Nanda 16 Jun 2025 18:50 UTC
        4 points
        0
        Parent
        OK, I’m going to bow out of the conversation at this point, I’d guess further back and forth won’t be too productive. Thanks all!
        Ben Pace 17 Jun 2025 6:49 UTC
        4 points
        2
        Parent
        There is something important to me in this conversation about not trusting one’s consequentialist analysis when evaluating proposals to violate deontological lines, and from my perspective you still haven’t managed to paraphrase this basic ethical idea or shown you’ve understood it, which I feel a little frustrated over. Ah well. I still have been glad of this opportunity to argue it through, and I feel grateful to Neel for that.
        Mikhail Samin 14 Jun 2025 19:31 UTC
        5 points
        5
        Parent
        I actually agree with Neel that, in principle, an AI lab could race for AGI while acting responsibly and IMO not violating deontology.
        Releasing models exactly at the level of their top competitor, immediately after the competitor’s release and a bit cheaper; talking to the governments and lobbying for regulation; having an actually robust governance structure and not doing a thing that increases the chance of everyone dying.
        This doesn’t describe any of the existing labs, though.
        habryka 14 Jun 2025 18:55 UTC
        4 points
        0
        Parent
        But they are not acting badly enough that I confidently assume bad faith
        I like a lot of your comment, but this feels like a total non-sequitur. Did anyone involved in this conversation say that Anthropic was acting under false pretenses? I don’t think anyone brought up concerns that rest on assumptions of bad faith (though to be clear, Anthropic employees have mostly told me I should assume something like bad faith from Anthropic as an institution, and that people should try to hold it accountable the same way any other AI lab, and to not straightforwardly trust statements Anthropic makes without associated commitments, so I do think I would assume bad faith, but it mostly just feels besides the point in this discussion).
        Expand this thread
        Neel Nanda 14 Jun 2025 18:59 UTC
        2 points
        0
        Parent
        Ah, sorry, I was thinking of Mikhail’s reply here, not anything you or Ben said in this conversation https://www.lesswrong.com/posts/BqwXYFtpetFxqkxip/mikhail-samin-s-shortform?commentId=w2doi6TzjB5HMMfmx
        
        But yeah, I’m happy to leave that aside, I don’t think it’s cruxy
        habryka 14 Jun 2025 19:01 UTC
        2 points
        0
        Parent
        Makes sense! I hadn’t read that subthread, so was additionally confused.
        Mikhail Samin 14 Jun 2025 19:35 UTC
        2 points
        −3
        Parent
        it was just for Allied soldiers to kill Nazi soldiers in World War II
        Killing anyone who hasn’t done anything to lose deontological protection is wrong and clearly violates deontology.
        As a Nazi soldier, you lose deontological protection.
        There are many humans who are not even customers of any of the AI labs; they clearly have not lost deontological protection, and it’s not okay to risk killing them without their consent.
        Expand this thread
        Neel Nanda 15 Jun 2025 11:39 UTC
        6 points
        5
        Parent
        I disagree with this as a statement about war, I’m sure a bunch of Nazi soldiers were conscripted, did not particularly support the regime, and were participating out of fear. Similarly, malicious governments have conscripted innocent civilians and kept them in line through fear in many unjust wars throughout history. And even people who volunteered may have done this due to being brainwashed by extensive propaganda that led to them believing they were doing the right thing. The real world is messy and strict deontological prohibitions break down in complex and high stakes situations, where inaction also has terrible consequences—I strongly disagree with a deontological rule that says countries are not about to defend themselves against innocent people forced to do terrible things
        Mikhail Samin 15 Jun 2025 18:06 UTC
        2 points
        0
        Parent
        My deontology prescribes not to join a Nazi army regardless of how much fear you’re in. It’s impossible to demand of people to be HPMOR!Hermione, but I think this standard works fine for real-world situations.
        (While I do not wish any Nazi soldiers death, regardless of their views or reasons for their actions. There’s a sense in which Nazi soldiers are innocent regardless of what they’ve done; none of them are grown up enough to be truly responsible for their actions. Every single death is very sad, and I’m not sure there has ever been even a single non-innocent human. At the same time, I think it’s okay to kill Nazi soldiers (unless they’re in a process of surrenderring, etc.) or lie to them, and they don’t have deontological protection.)
        You’re arguing it’s okay to defend yourself against innocent people forced to do terrible things. I agree with that, and my deontology agrees with that.
        At the same time, killing everyone because otherwise someone else could’ve killed them with a higher chance = killing many people who aren’t ever going to contribute to any terrible things. I think, and my deontology thinks, that this is not okay. Random civilians are not innocent Nazi soldiers; they’re simply random innocent people. I ask of Anthropic to please stop working towards killing them.
        Neel Nanda 15 Jun 2025 18:24 UTC
        3 points
        0
        Parent
        And do you feel this way because you believe that the general policy of obeying such deontological prohibitions will on net result in better outcomes? Or because you think that even if there were good reason to believe that following a different policy would lead to better empirical outcomes, your ethics say that you should be deontologically opposed regardless?
        Mikhail Samin 15 Jun 2025 19:12 UTC
        5 points
        2
        Parent
        I think the general policy of obeying such deontological rules leads to better outcomes; this is the reason for having deontology in the first place. (I agree with that old post on what to do when it feels like there’s a good reason to believe that following a different policy would lead to better outcomes.)
        habryka 15 Jun 2025 19:15 UTC
        4 points
        1
        Parent
        (Just as a datapoint, while largely agreeing with Ben here, I really don’t buy this concept of deontological protection of individuals. I think there are principles we have about when it’s OK to kill someone, but I don’t think the lines we have here route through individuals losing deontological protection.
        Killing a mass murderer while he is waiting for trial is IMO worse than killing a civilian in collateral damage as part of taking out an active combatant, because it violates and messes with different processes, which don’t generally route through individuals “losing deontological protection” but instead are more sensitive to the context the individuals are in)
        Mikhail Samin 16 Jun 2025 8:17 UTC
        2 points
        0
        Parent
        Locally: can you give an example of when it’s okay to kill someone who didn’t lose deontological protection, where you want to kill them because of the causal impact of their death?
        Ben Pace 17 Jun 2025 18:00 UTC
        4 points
        2
        Parent
        To me the issue goes the other way. The idea of “losing deontological protection” suggests I’m allowed to ignore deontological rules when interacting with someone. But that seems obviously crazy to me. For instance I think there’s a deontological injunction against lying, but just because someone lies doesn’t now mean I’m allowed to kill them. It doesn’t even mean I’m allowed to lie to them. I think lying to them would still be about as wrong as it was before, not a free action I can take whenever I feel like it.
        habryka 17 Jun 2025 18:29 UTC
        3 points
        2
        Parent
        I mean, a very classical example that I’ve seen a few times in media is shooting a civilian who is about to walk into a minefield in which multiple other civilians or military members are located. It seems tragic but obviously the right choice to shoot them if they don’t heed your warning.
        IDK, I also think it’s the right choice to pull the lever in the trolley problem, though the choice becomes less obvious the more it involves active killing as opposed to literally pulling a lever.
        Knight Lee 23 Jun 2025 6:53 UTC
        7 points
        −6
        Parent
        Sorry for replying to a dead thread but,
        Murder implies an intent to kill someone.
        Suppose I hire a hitman to kill you. But suppose there already are 3 hitmen trying to kill you, and I’m hoping my hitman would reach you first, and I know that my hitman has really bad aim. Once the first hitman reaches you and starts shooting, the other hitmen will freak out and run away, so I’m hoping you’re more likely to survive.
        I have no other options for saving you, since the only contact I have is a hitman, and he’s very bad at English and doesn’t understand any instructions except trying to kill someone.
        In this case, you can argue to the court that my plan to save you was retarded. But you cannot concede that my plan actually was a good idea consequentially, but deontologically unethical. Since I didn’t intend to kill anyone.
        Deontology only kicks in when your plan involves making someone die, or greatly increasing the chance someone dies.
        Mikhail Samin 23 Jun 2025 15:34 UTC
        7 points
        1
        Parent
        I feel like this it’s actually a great analogy! The only difference is that if your hitman starts shooting and doesn’t kill anyone, you get infinite gold.
        You know that in real life you go to police instead of hiring a hitman, right?
        And I claim that it’s really not okay to hire a hitman who might lower the chance of the person ending up dead, especially when your brain is aware of the infinite gold part.
        The good strategy for anyone in that situation to follow is to go to the police or go public and not hire any additional hitmen.
        Expand this thread
        Knight Lee 24 Jun 2025 2:36 UTC
        3 points
        2
        Parent
        Yeah, it’s less deontologically bad than murder but I admit it’s still not completely okay.
        PS: Part of the reason I used the unflattering hitman analogy is because I’m no longer as optimistic about Anthropic’s influence.
        They routinely describe other problems (e.g. winning the race against China to defend democracy) with the same urgency as AI Notkilleveryoneism.
        The only way to believe that AI Notkilleveryoneism is still Anthropic’s main purpose, is to hope that,
        They describe a ton of other problems with the same urgency as AI Notkilleveryoneism, but that is only due to political necessity.
        At the same time, their apparent concern for AI Notkilleveryoneism is not just a political maneuver, but significantly more genuine.
        This “hope” is plausible since the people in charge of Anthropic prefer to live, and consistently claimed to have high P(doom).
        But it’s not certain, and there is circumstantial evidence suggesting this isn’t the case (e.g. their lobbying direction, and how they’re choosing people for their board of directors).
        Maybe $^{50 %}$ this hope is just cope :(
        Ben Pace 23 Jun 2025 16:50 UTC
        6 points
        0
        Parent
        I don’t agree that deontology is about intent. Deontology is about action. Deontology is about not hiring hitmen to kill someone even if you have a really good reason, and even if your intent is good. Deontology is substantially about schelling lines of action where everything gets hard to predict and goes bad after you commit it.
        I imagine that your incompetent hitman has only like a 50% chance of succeeding, whereas the others have ~100%, that seems deontologically wrong to me.
        It seems plausible that what you mean to say by the hypothetical is that he has 0% chance.
        I admit this is more confusing and I’m not fully resolved on this.
        I notice I am confused about how you can get that epistemic state in real life.
        I observe that society will still prosecute you for attempted murder if you buy a hitman off the dark web, even one with a clearly incompetent reputation for ⁰⁄₁₀ kills or whatever.
        I think society’s ability to police this line is not as fine grained as you’re imagining, and so you should not buy incompetent hitmen in order to not kill your friend, unless you’re willing to face the consequences.
        Expand this thread
        Knight Lee 24 Jun 2025 1:57 UTC
        1 point
        0
        Parent
        To be honest I couldn’t resist writing the comment because I just wanted to share the silly thought :/
        Now that I think about it, it’s much more complicated. Mikhail Samin is right that the personal incentive of reaching AGI first really complicates the good intentions. And while a lot of deontology is about intent, it’s hyperbole to say that deontology is just intent.
        I think if your main intent is to save someone (and not personal gain), and your plan doesn’t require or seek anyone’s death, then it is deontologically much less bad than evil things like murder. But it may still be too bad for you to do, if you strongly lean towards deontology rather than consequentialism. Even if the court doesn’t find you guilty of first degree murder, it may still find you guilty of… some… things.
        One might argue that the enormous scale (risking everyone’s death instead of only one person), makes it deontologically worse. But I think the balance does not shift in favor of deontology and against consequentialism as we increase the scale (it might even shift a little in favor of consequentialism?).
        MondSemmel 12 Jun 2025 8:25 UTC
        3 points
        0
        Parent
        That’s fair, but the deontological argument doesn’t work for anyone building the extinction machine who is unconvinced by x-risk arguments, or deludes themselves that it’s not actually an extinction machine, or that extinction is extremely unlikely, or that the extinction machine is the only thing that can prevent extinction (as in all the alignment via AI proposals) etc. etc.
        Mikhail Samin 12 Jun 2025 13:34 UTC
        6 points
        4
        Parent
        This is not the case for many at Anthropic.
        Ben Pace 12 Jun 2025 8:30 UTC
        6 points
        5
        Parent
        True; in general, many people who behave poorly do not know that they do so.
        sjadler 13 Jun 2025 7:28 UTC
        1 point
        0
        Parent
        Plugging that I wrote a post which quotes Anthropic execs at length describing their views on race to the top: https://open.substack.com/pub/stevenadler/p/dont-rely-on-a-race-to-the-top (and yes agreed with Neel’s summary)
        sunwillrise 10 Jun 2025 16:31 UTC
        5 points
        2
        Parent
        I suppose if you think it’s less likely there will be killing involved if you’re the one holding the overheating gun than if someone else is holding it, that hard line probably goes away.
        Ben Pace 10 Jun 2025 18:35 UTC
        6 points
        4
        Parent
        Just because someone else is going to kill me, doesn’t mean we don’t have an important societal norm against murder. You’re not allowed to kill old people just because they’ve only got a few years left, or kill people with terminal diseases.
        sunwillrise 10 Jun 2025 18:47 UTC
        7 points
        2
        Parent
        I don’t see how that at all addresses the analogy I made.
        Ben Pace 11 Jun 2025 1:18 UTC
        5 points
        0
        Parent
        I am not quite sure what an overheating gun refers to, I am guessing the idea is that it has some chance of going off without being fired.
        Anyhow, if that’s accurate, it’s acceptable to decide to be the person holding an overheating gun, but it’s not acceptable to (for example) accept a contract to assassinate someone so that you get to have the overheating gun, or to promise to kill slightly fewer people with the gun than the next guy. Like, I understand consequentially fewer deaths happen, but our society has deontological lines against committing murder even given consequentialist arguments, which are good. You’re not allowed to commit murder even if you have a good reason.
        MondSemmel 10 Jun 2025 18:13 UTC
        2 points
        0
        Parent
        I fully expect we’re doomed, but I don’t find this attitude persuasive. If you don’t want to be killed, you advocate for actions that hopefully result in you not being killed, whereas this action looks like it just results in you being killed by someone else. Like you’re facing a firing squad and pleading specifically with just one of the executioners.
        Ben Pace 10 Jun 2025 18:37 UTC
        1 point
        0
        Parent
        I just want to clarify that Anthropic doesn’t have the social authority of a governmental firing squad to kill people.
    - MondSemmel 10 Jun 2025 9:01 UTC
      8 points
      2
      Parent
      For me the missing argument in this comment thread is the following: Has anyone spelled out the arguments for how it’s supposed to help us, even incrementally, if one AI lab (rather than all of them) drops out of the AI race? Suppose whichever AI lab is most receptive to social censure could actually be persuaded to drop out; don’t we then just end in an Evaporative Cooling of Group Beliefs situation where the remaining participants in the race are all the more intransigent?
      - sunwillrise 10 Jun 2025 14:25 UTC
        11 points
        4
        Parent
        Has anyone spelled out the arguments for how it’s supposed to help us, even incrementally, if one AI lab (rather than all of them) drops out of the AI race?
        An AI lab dropping out helps in two ways:
        timelines get longer because the smart and accomplished AI capabilities engineers formerly employed by this lab are no longer working on pushing for SOTA models/no longer have access to tons of compute/are no longer developing new algorithms to improve performance even holding compute constant. So there is less aggregate brainpower, money, and compute dedicated to making AI more powerful, meaning the rate of AI capability increase is slowed. With longer timelines, there is more time for AI safety research to develop past its pre-paradigmatic stage, for outreach effort to mainstream institutions to start paying dividends in terms of shifting public opinion at the highest echelons, for AI governance strategies to be employed by top international actors, and for moonshots like uploading or intelligence augmentation to become more realistic targets.
        race dynamics become less problematic because there is one less competitor other top labs have to worry about, so they don’t need to pump out top models quite as quickly to remain relevant/retain tons of funding from investors/ensure they are the ones who personally end up with a ton of power when more capable AI is developed.
        I believe these arguments, frequently employed by LW users and alignment researchers, are indeed valid. But I believe their impact will be quite small, or at the very least meaningfully smaller than what other people on this site likely envision.
        And since I believe the evaporative cooling effects you’re mentioning are also real (and quite important), I indeed conclude pushing Anthropic to shut down is bad and counterproductive.
        ProgramCrafter 10 Jun 2025 14:51 UTC
        1 point
        0
        Parent
        the smart and accomplished AI capabilities engineers formerly employed by this lab are no longer working on pushing for SOTA models/no longer have access to tons of compute/are no longer developing new algorithms to improve performance
        For that to be case, instead of engineers entering another company, we should suggest other tasks. There are very questionable technologies shipped indeed (for example, social media with automatic recommendation algorithms) but someone would have to connect the engineers to the tasks.
      - MichaelDickens 11 Jun 2025 23:45 UTC
        6 points
        2
        Parent
        I agree with sunwillrise but I think there is an even stronger argument for why it would be good for an AI company to drop out of the race. It is a strong jolt that has a good chance of waking up the world to AI risk. It sends a clear message:
        
        We were paper billionaires and we were on track to be actual billionaires, but we gave that up because we were too concerned that the thing we were building could kill literally everyone. Other companies are still building it. They should stop too.
        
        I don’t know exactly what effect that would have on public discourse, but the effect would be large.
        MondSemmel 12 Jun 2025 8:17 UTC
        1 point
        −3
        Parent
        Larger than the OpenAI board fiasco? I doubt it.
        MichaelDickens 12 Jun 2025 16:40 UTC
        2 points
        0
        Parent
        A board firing a CEO is a pretty normal thing to happen, and it was very unclear that the firing had anything to do with safety concerns because the board communicated so little.
        
        A big company voluntarily shutting down because its product is too dangerous is (1) a much clearer message and (2) completely unprecedented, as far as I know.
        
        In my ideal world, the company would be very explicit that they are shutting down specifically because they are worried about AGI killing everyone.
      - Ben Pace 12 Jun 2025 0:57 UTC
        2 points
        0
        Parent
        I make the case here for stopping based on deontological rather than consequentialist reasons.
    - faul_sname 10 Jun 2025 4:20 UTC
      7 points
      6
      Parent
      My understanding was that LessWrong, specifically, was a place where bad arguments are (aspirationally) met with counterarguments, not with attempts to suppress them through coordinated social action. Is this no longer the case, even aspirationally?
      - habryka 10 Jun 2025 5:00 UTC
        10 points
        6
        Parent
        I think it would be bad to suppress arguments! But I don’t see any arguments being suppressed here. Indeed, I see Zack as trying to create a standard where (for some reason) arguments about AI labs being reckless must be made directly to the people who are working at those labs, and other arguments should not be made, which seems weird to me. The OP seems to me like it’s making fine arguments.
        I don’t think it was ever a requirement for participation on LessWrong to only ever engage in arguments that could change the minds of the specific people who you would like to do something else, as opposed to arguments that are generally compelling and might affect those people in indirect ways. It’s nice when it works out, but it really doesn’t seem like a tenet of LessWrong.
        faul_sname 10 Jun 2025 5:10 UTC
        2 points
        0
        Parent
        Ah, I had (incorrectly) interpreted “It’s eminently reasonable for people to just try to stop whatever is happening, which includes intention for social censure, convincing others, and coordinating social action” as being an alternative to engaging at all with the arguments of people who disagree with your positions here, rather than an alternative to having that standard in the outside world with people who are not operating under those norms.
    - Zach Stein-Perlman 10 Jun 2025 3:50 UTC
      2 points
      2
      Parent
      Sure, censure among people who agree with you is a fine thing for a comment to do. I didn’t read Mikhail’s comment that way because it seemed to be asking Anthropic people to act differently (but without engaging with their views).
      - habryka 10 Jun 2025 3:57 UTC
        14 points
        13
        Parent
        It’s OK to ask people to act differently without engaging with your views! If you are stabbing my friends and family I would like you to please stop, and I don’t really care about engaging with your views. The whole point of social censure is to ask people to act differently even if they disagree with you, that’s why we have civilization and laws and society.
  - habryka 10 Jun 2025 3:31 UTC
    32 points
    15
    Parent
    I think Anthropic leadership should feel free to propose a plan to do something that is not “ship SOTA tech like every other lab”. In the absence of such a plan, seems like “stop shipping SOTA tech” is the obvious alternative plan.
    Clearly in-aggregate the behavior of the labs is causing the risk here, so I think it’s reasonable to assume that it’s Anthropic’s job to make an argument for a plan that differs from the other labs. At the moment, I know of no such plan. I have some vague hopes, but nothing concrete, and Anthropic has not been very forthcoming with any specific plans, and does not seem on track to have one.
    - Vaniver 11 Jun 2025 6:28 UTC
      9 points
      2
      Parent
      I think Anthropic leadership should feel free to propose a plan to do something that is not “ship SOTA tech like every other lab”. In the absence of such a plan, seems like “stop shipping SOTA tech” is the obvious alternative plan.
      Note that Anthropic, for the early years, did have a plan to not ship SOTA tech like every other lab, and changed their minds. (Maybe they needed the revenue to get the investment to keep up; maybe they needed the data for training; maybe they thought the first mover effects would be large and getting lots of enterprise clients or w/e was a critical step in some of their mid-game plans.) But I think many plans here fail once considered in enough detail.
    - Stephen McAleese 12 Jun 2025 8:26 UTC
      2 points
      0
      Parent
      Anthropic’s responsible scaling policy does mention pausing scaling if the capabilities of their models exceeds their best safety methods:
      
      “We have designed the ASL system to strike a balance between effectively targeting catastrophic risk and incentivising beneficial applications and safety progress. On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures. But it does so in a way that directly incentivizes us to solve the necessary safety issues as a way to unlock further scaling, and allows us to use the most powerful models from the previous ASL level as a tool for developing safety features for the next level.”
      
      I think OP and others in the thread are wondering why Anthropic doesn’t stop scaling now given the risks. I think the reason why is that in practice doing so would create a lot of problems:
      
      How would Anthropic fund their safety research if Claude is no longer SOTA and becomes less popular?
      Is Anthropic supposed to learn from and test only models at current levels of capability and how does it learn about future advanced model behaviors? I haven’t heard a compelling argument for how we could solve superalignment by studying much less advanced models. Imagine trying to align GPT-4 or o3 by only studying and testing GPT-2 from 2019. In reality, future models will probably have lots of unknown unknowns and emergent properties that are difficult or impossible to predict in advance. And then there’s all the social consequences of AI like misuse which are difficult to predict in advance.
      
      Although I’m skeptical that alignment can be solved without a lot of empirical work on frontier models I still think it would better if AI progress were slower.
      - Mikhail Samin 12 Jun 2025 13:40 UTC
        4 points
        4
        Parent
        I don’t expect Anthropic to stick to any of their policies when competitive pressure means they have to train and deploy and release or be left behind. None of their commitments are of a kind they wouldn’t be able to walk back on.
        Anthropic accelerates capabilities more than safety; they don’t even support regulation, with many people internally being misled about Anthropic’s efforts. None of their safety efforts meaningfully contributed to solving any of the problems you’d have to solve to have a chance of having something much smarter than you that doesn’t kill you.
        I’d be mildly surprised if there’s a consensus at Anthropic that they can solve superalignment. The evidence they’re getting shows, according to them, that we live in an alignment-is-hard world.
        If any of these arguments are Anthropic’s, I would love for them to say that out loud.
  - Mikhail Samin 10 Jun 2025 6:09 UTC
    7 points
    0
    Parent
    I’ve generally been aware of/can come up with some arguments; I haven’t heard them in detail from anyone at Anthropoid and would love for Anthropic to write up the plan that includes reasoning why shipping SOTA models helps humanity survive instead of doing the opposite thing.
    The last time I saw Anthropic’s claimed reason for existing, it later became an inspiration for
- eggsyntax 10 Jun 2025 21:36 UTC
  4 points
  2
  Parent
  I’m confused about why you’re pointing to Anthropic in particular here. Are they being overoptimistic in a way that other scaling labs are not, in your view?
  - Mikhail Samin 11 Jun 2025 7:22 UTC
    7 points
    2
    Parent
    Unlike other labs, Anthropic is full of people who care and might leave capabilities work or push for the leadership to be better. It’s a tricky place to be in: if you’re responsible enough, you’ll hear more criticism than less responsible actors, because criticism can still change what you’re doing.
    Other labs are much less responsible, to be clear. There’s it a lot (I think) my words here can do about that, though.
    - eggsyntax 11 Jun 2025 16:25 UTC
      2 points
      0
      Parent
      Got it. It might be worth adding something like that to the post, which in my opinion reads as if it’s singling out Anthropic as especially deserving of criticism.
- DirectedEvolution 10 Jun 2025 7:15 UTC
  4 points
  2
  Parent
  I understand your argument and it has merit, but I think the reality of the situation is more nuanced.
  Humanity has long build buildings and bridges without access to formal engineering methods for predicting the risk of collapse. We might regard it as unethical to build such a structure now without using the best practically available engineering knowledge, but we do not regard it as having been unethical to build buildings and bridges historically due to the lack of modern engineering materials and methods. They did their best, more or less, with the resources they had access to at the time.
  AI is a domain where the current state of the art safety methods are in fact being applied by the major companies, as far as I know (and I’m completely open to being corrected on this). In this respect, safety standards in the AI field are comparable to those of other fields. The case for existential risk is approximately as qualitative and handwavey as the case for safety, and I think that both of these arguments need to be taken seriously, because they are the best we currently have. It is disappointing to see the cavalier attitude with which pro-AI pundits dismiss safety concerns, and obnoxious to see the overly confident rhetoric deployed by some in the safety world when they tweet about their p(doom). It is a weird and important time in technology, and I would like to see greater open-mindedness and thoughtfulness about the ways to make progress on all of these important issues.
- Viliam 10 Jun 2025 14:33 UTC
  3 points
  0
  Parent
  No other engineering field would accept “I hope we magically pass the hardest test on the first try, with the highest stakes” as an answer.
  Perhaps the answer is right there, in the name. The future Everett branches where we still exist will indeed be the ones where we have magically passed the hardest test on the first try.
  - Mikhail Samin 10 Jun 2025 16:10 UTC
    5 points
    0
    Parent
    Branches like that don’t have a lot of reality-fluid and lost most of the value of our lightcone; you’re much more likely to find yourself somewhere before that.
- Mikhail Samin 10 Jun 2025 21:14 UTC
  2 points
  1
  Parent
  Does “winning the race” actually give you a lever to stop disaster, or does it just make Anthropic the lab responsible for the last training run?
  Does access to more compute and more model scaling, with today’s field understanding, truly give you more control—or just put you closer to launching something you can’t steer? Do you know how to solve alignment given even infinite compute?
  Is there any sign, from inside your lab, that safety is catching up faster than capabilities? If not, every generation of SOTA increases the gap, not closes it.
  “Build the bomb, because if we don’t, someone worse will.”
  Once you’re at the threshold where nobody knows how to make these systems steerable or obedient, it doesn’t matter who is first—you still get a world-ending outcome.
  If Anthropic, or any lab, ever wants to really make things go well, the only winning move is not to play, and try hard to make everyone not play.
  If Anthropic was what it imagines itself being, it would build robust field-wide coordination and support regulation that would be effective globally, even if it means watching over your shoulder for colleagues and competitors across the world.
  If everyone justifies escalation as “safety”, there is no safety.
  In the end, if the race leads off a cliff, the team that runs fastest doesn’t “win”: they just get there first. That’s not leadership. It’s tragedy.
  If you truly care about not killing everyone, will have to be a point—maybe now—where some leaders stop, even if it costs, and demand a solution that doesn’t sacrifice the long-term for a financial gain due to having a model slightly better than those of your competitors.
  Anthropic is in a tricky place. Unlike other labs, it is full of people who care. The leadership has to adjust for that.
  That makes you one of the few people in history who has the chance to say “no” to the spiral to the end of the world and demand of your company to behave responsibly.
  
  (note: many of these points are AI-generated by a model with 200k tokens of Arbital in its context; though heavily edited.)