My time on LessWrong is limited due to other commitments. Do not expect replies.
Martin Randall
I don’t think destroying one megacorp is easier than destroying two, each of half the size. My model: it’s easier to destroy a hundred companies with one employee than to destroy one company with a hundred employees, if one is coming in with an external force.
Conditional on OpenAI forcibly taking over Anthropic, it is more likely that Alphabet forcibly takes over OpenAI (or OpenAI forcibly takes over the AI portions of Alphabet). I was more thinking of “what if a freak event caused a forced takeover” rather than “what if corporate murder of AI labs became normal”.
Thanks for the thoughtful response.
Would we be less likely to go extinct if OpenAI forcibly took over Anthropic? (a possible hypothetical: a combined effort of OpenAI and the US government forces a merger). My take: no. Main reasons:
I trust OpenAI leadership less than Anthropic leadership
Race dynamics would not improve due to the other players.
I judge the creation of Anthropic as retrospectively a good choice. At the time I was negative and have become less so.
Similarly, would we be less likely to go extinct if Alphabet forcibly took over OpenAI and Anthropic? My take: yes. Main reasons:
Substantial improvement in race dynamics.
Alphabet has less corporate need for urgent AI advances.
Alphabet has a better security posture.
I judge the creation of OpenAI as retrospectively a bad choice. At the time I was negative and have become more so.
EDIT: I wrote the comment below in response to the first paragraph only, pre-edit. With the new version I think we’re actually very close to agreement!
I don’t understand what seems absurd to you. I didn’t invent the concepts of hearsay, conflicts of interest, fallible human memory, hindsight bias, or selective reporting. I expect you agree that these are real phenomena. Here’s my best guess at our disconnect:
If I’d ever had the faintest, tiniest credence in Anthropic’s “Responsible Scaling Policy”, I’d probably feel pretty betrayed right now!
Claim: Yudkowsky never had the faintest, tiniest credence in Anthropic’s “Responsible Scaling Policy”.
My standard of evidence: minimal. People are allowed to believe what they want about what they believed in the past. Most beliefs occur in the privacy of one’s own head. Others are shared with friends, or verbally. Relatively few people share any of their beliefs in writing, and those that do share only a fraction of their beliefs. In any case, past beliefs are part of a person’s identity and life story, and accepting them as stated is good etiquette.
I ask only that you update, and not always be surprised in the same direction of “huh, Eliezer was right to call it empty”.
Claim: Yudkowsky is a great forecaster, give status and trust.
My standard of evidence: high. Preferably clear advance written predictions. I would also accept personally hearing and recalling clear advance verbal predictions, or contemporaneous reports of such advance verbal predictions. So for the 30+ people who were in the room, yeah, they should update.
My guess is there were a lot of people there who could testify to that as well.
Yes, if that happens then I will also update. This would also address my concerns of not having the exact prediction made, its confidence level, or the other predictions. The typical pattern with events two years ago is that ten witnesses recall ten slightly different versions of events.
I would only be able to make this update because I have basic trust in the people likely to be present at LessOnline 2024. There’s still lots of benefit in getting advance predictions on the record so that they are legible to others.
Thanks for letting me know. Unfortunately, if this is the only place where the prediction was made then most people can’t make much of an update.
Hearsay from a witness with a conflict of interest
Fallible human memory of conversations from two years ago
We don’t have the exact prediction made, or its confidence level
We don’t have the other predictions made for overall scoring
It’s not a good look to make predictions in private and then later brag about them in public. Especially if you wrote Hindsight bias, for example.
Hindsight bias is when people who know the answer vastly overestimate its predictability or obviousness, compared to the estimates of subjects who must guess without advance knowledge. Hindsight bias is sometimes called the I-knew-it-all-along effect.
By contrast, consider the post Unless its governance changes, Anthropic is untrustworthy by Mikhail Samin, 2025-11-29. This gives the author very good standing for saying “I told you so”.
Establishing Yudkowsky’s credibility as a forecaster would decrease the probability of human extinction (perhaps from 100% to 100%) so I would encourage him to publish such predictions in the future, including any from the LessOnline 2024 batch that are still relevant. This seems especially important given the frequent critique of his world model as unfalsifiable stories of doom.
Found one while looking for something else: Anthropic leadership conversation.
Jared Kaplan at 25:11 Pushes back a little: all of the above were reasons to be excited about the RSP ex ante but it’s been surprisingly hard and complicated to determine evals and thresholds; in AI there’s a big range where you don’t know whether a model is safe. (Kudos.)
Not sure if that’s what you’re looking for.
I downvoted my own answer because it wasn’t very good, relative to later answers.
I’m not sure how much of this is just Spring beginning. I typically have a good streak of energy from March onwards, and then do a micro-burnout in mid-May. Remains to be seen if that happens again.
For what it’s worth, I have been using “Coach Claude” in a similar way since mid-2025, and have experienced a “willpower overhang” where suddenly I have far more willpower than I’m used to. Eg: lost ~10% weight, ~50% less mobile screen use, etc. I think there’s more danger of, eg, eating disorders, rather than traditional burnout.
In Reason the religious robot at one point starts to convince the human engineers that maybe the religious robot is right, but in the end the human engineers hold onto to their priors that humans created robots.
The Whispering Earring is in that direction. The “Robbie” and “Reason” short stories from I, Robot are also similar. That’s the best I have.
Eliezer Yudkowsky: If I’d ever had the faintest, tiniest credence in Anthropic’s “Responsible Scaling Policy”, I’d probably feel pretty betrayed right now!
As it is, I ask only that you update, and not always be surprised in the same direction of “huh, Eliezer was right to call it empty”.
Note: to observe how my cynicism repeatedly *ends up* right, tally only how things *end up*. Don’t jump and say “See, Eliezer was wrong to be cynical!” the moment you hear an uncashed promise or see an arguable sign of later hope.
I would love to update in the direction of “huh, Eliezer was right to call it empty”. However, to do that I need to read the place where Yudkowsky called Anthropic’s RSP empty, in advance, and view it in context. Does anyone have a source?
Genuine question! I tried to answer my own question below, and failed to find one. I searched myself, and “asked the shoggoths to search for me”, and I think this is a retrodiction masquerading as a prediction. Reader, if you have a source, I welcome it, and will gratefully retract. Absence of evidence is not evidence of absence, and the rest of my comment below is trivially disprovable with a single hyperlink.
Otherwise, this is a long-term pattern as discussed back in 2022: Beware boasting about non-existent forecasting track records. I expect other commenters can provide better documentation of their successful advanced predictions. For example, Zvi has NO bets on the linked market. Unfortunately, in this world, claims of “I told you so” are only credible with links to the places where you told us so.
I fail to find a source, at length
After writing the above,
I broke out my copy of If Anyone Builds It, Everyone Dies and the “I don’t want to be alarmist” chapter. This is the most relevant quote I found:
No company wants to miss out on the money, if a rung is safe. Now consider the sort of corporate executive who has convinced themselves that they alone have the bets chance − 80 percent, say—of shaping the explosion into something that benefits rather than harms humanity. Why they’d think it’s imperative they be the first to ascend.
I agree with this, but it’s entirely possible for this to be true in a world where Anthropic’s RSP is not “empty”. In Prisoner’s Dilemma, no prisoner wants to miss out on the chance to go free, if they betray the other prisoner, but observing that fact isn’t a prediction that all prisoners will defect, and indeed not all prisoners do defect.
Claude Research Mode, looking hard for Yudkowsky making this prediction in any form, found me this: X: Failing to continuously test your AI as it grows into superintelligence, such that it could later just sandbag all interesting capabilities on its first round of evals, is a relatively less dignified way to die. Any takers besides Anthropic?So “relatively less dignified way to die” is arguably pointing out that lab-specific policies and country-specific policies are insufficient, and we need an international treaty. But something can be insufficient without being “empty”. A vegan diet is grossly insufficient to end animal suffering, but it’s not empty.
Yudkowsky’s position in Re: recent Anthropic safety research is skeptical towards its accuracy, but says that people should go on looking hard for early manifestations of arguable danger. He also says that Anthropic management play clever PR games, but without saying that Anthropic’s RSP in particular is a clever PR game. It’s not like Yudkowsky is carefully avoiding saying anything negative about Anthropic for his own clever PR games. X: Anthropic straight-up wouldn’t do moderately-bad stuff to vulnerable users, I think. That’s not the road down which Dario Amodei walks into Hell.
Broadening the scope, what about MIRI? Well MIRI’s April 2024 Newsletter discusses how they want to look at the limitations of RSPs, and went on to publish Declare and Justify: Explicit assumptions in AI evaluations are necessary for effective regulation and What AI evaluations for preventing catastrophic risks can and cannot do. These aren’t claims that Anthropic’s policies are empty, they are claims that evals are insufficient to prove safety. They’re also mostly focused on such policies as AI governance rather than voluntary lab policies, which makes sense.
More on the accusation against Amodei of making a strawman argument. I think that’s a false accusation. Here’s Amodei:
One of the most important hidden assumptions, and a place where what we see in practice has diverged from the simple theoretical model, is the implicit assumption that AI models are necessarily monomaniacally focused on a single, coherent, narrow goal, and that they pursue that goal in a clean, consequentialist manner. In fact, our researchers have found that AI models are vastly more psychologically complex, as our work on introspection or personas shows. Models inherit a vast range of humanlike motivations or “personas” from pre-training (when they are trained on a large volume of human work).
And Yudkowsky:
A paperclip maximizer is not “monomoniacally” “focused” on paperclips. We talked about a superintelligence that wanted 1 thing, because you get exactly the same results as from a superintelligence that wants paperclips and staples (2 things), or from a superintelligence that wants 100 things. The number of things It wants bears zero relevance to anything. It’s just easier to explain the mechanics if you start with a superintelligence that wants 1 thing, because you can talk about how It evaluates “number of expected paperclips resulting from an action” instead of “expected paperclips * 2 + staples * 3 + giant mechanical clocks * 1000” and onward for a hundred other terms of Its utility function that all asymptote at different rates.
What do you get from a consequentialist superintelligence that wants paperclips and staples? By default, you get a squiggle maximizer. The molecular squiggles are one of three things:
The cheapest molecule that counts as a paperclip
The cheapest molecule that counts as a staple
The cheapest molecule that counts as both a paperclip and a staple
The same applies for a consequentialist superintelligence that has a hundred other simple terms of its utility function. If the terms of its utility function asymptote at different rates then almost all of the universe is converted into whatever asymptotes at the slowest rate. The utility function may be complex in some mathematical sense, but the behavior is still “monomaniacally focused on a single, coherent, narrow goal”. One might also call it the behavior of ruthless sociopath.
To me, the main error of Amodei’s quote is calling this a “hidden assumption”, when it’s actually a carefully argued conclusion.
A better response is Why we should expect ruthless sociopath ASI. This responds to the observations underlying Amodei:
Yes, many have predicted a risk from future AIs being ruthless sociopathic consequentialists
No, current LLM AIs are not ruthless sociopathic consequentialists as far as we can tell
That’s because current LLM AIs are not consequentialists.
However, either we’ll get future AIs that use a different paradigm and are consequentialists, or we’ll find a way to make LLM AIs that are consequentialist, because consequentialism is effective.
Shortly after that, everyone dies.
Unless we don’t build it.
If I read correctly, in this setup the AI players don’t get to communicate with each other prior to making their decisions. I predict that adding communication turns will substantially improve utilitarian performance by models that are trained to be HHH.
By default, I expect that when ASIs/Claudes/Minds are in charge of the future, either there will be humans and non-human animals, or there will be neither humans nor non-human animals. Humans and cats have more in common than humans and Minds. Obviously it is possible to create intelligences that differentially care about different animal species in semi-arbitrary ways, as we have an existence proof in humans. But this doesn’t seem to be especially stable, as different human cultures and different humans draw those lines in different ways.
Selfishly, a human might want a world full of happy, flourishing humans, and selfishly a Mind might want a world full of happy, flourishing Minds. Consider how good a Mind would think a future with happy, flourishing Minds but almost no flourishing humans and no human suffering is compared to a world with flourishing present-day humans. What if it’s 90% as good and has 20% lower risk of disaster? What if the Mind isn’t confident that humans are truly conscious or have moral patienthood?
I wouldn’t go as far as saying that “training AIs to explicitly not care about animals is incompatible with alignment”. Many things are possible with superhuman intelligence. But I don’t see any way that humans can achieve this. We are not capable of reliably training baby humans to grow into adult humans that have specific views on animal welfare and moral patienthood.
A lot can change between now and 100,000 pledges and/or human extinction. As of Feb 2026, it looks like this possible march is not endorsed or coordinated with Pause AI. I hope that anti-AI-extinction charities will work together where effective, and I was struck by this:
The current March is very centered around the book. I chose the current slogan/design expecting that, if the March ever became a serious priority, someone would put a lot more thought into what sort of slogans or policy asks are appropriate. The current page is meant to just be a fairly obvious thing to click “yes” on if you read the book and were persuaded.
My personal guess (not speaking for MIRI) is a protest this large necessarily needs to be a bigger tent than the current design implies, but figuring out the exact messaging is a moderately complex task.
It seems like Pause AI have put at least some thought into this moderately complex task. They also are building experience organizing real world protests that MIRI doesn’t have as far as I know. A possible implication is that MIRI thinks that Pause AI is badly run, and would rather act alone. Or that Pause AI thinks MIRI is badly run. Or MIRI is not investing the time in trying to organize endorsements until they have more pledges. Or something else.
I’m skeptical of this take:
Marches can be very powerful if they’re large, but can send the wrong message if they’re small.
The first protest of “School Strike for Climate” was a single 15yo girl, Greta Thunberg. Obvious bias is obvious. But it probably wasn’t going to send the wrong message as a small protest—if it had gone nowhere then I would never have heard about it. If tiny marches were sabotaging then I would expect more fake flag marches intended to have sparse attendance. Instead, I think small events don’t send any mass message, and potentially have other value.
Edit: after posting this I saw Raemon’s thoughts on this point, which I think address it.
MIRI was for many years dismissive of mass messaging approaches like marches. I wonder if this page is about providing an answer when people ask questions like “if you think everyone will die, why aren’t you organizing a march on Washington?”, rather than being a serious part of MIRI’s strategy for reducing AI risk. It doesn’t seem especially aligned with MIRI Comms is hiring (Dec 2025), which seems more focused on persuasion than mobilization.
Disclaimers: This is observations, not criticism. I have organized zero marches or protests.
I like the analogy. Here’s a simplified version where the ticket number is good evidence that the shop will close sooner rather than later.
There are two types of shop in Glimmer. Half of them are 24⁄7 shops that stay open until they go out of business. Half of them are 9-5 shops that open at 9am and close at 5pm.
All shops in Glimmer use a numbered ticket system that starts at #1 for the first customer after they open, and resets when the shop closes.
I walk into a shop in Glimmer at random and get a ticket.
If the ticket number is #20 then I update towards the shop being a 9-5 shop, on the grounds that otherwise my ticket number is atypically low. If the ticket number is #43,242 then I update towards the shop being a 24⁄7 shop.
The argument also works with customer flow evidence:
Like Glimmer, there are two types of shop in Silktown. Half of them are 24⁄7 shops that stay open until they go out of business. Half of them are 9-5 shops that open at 9am and close at 5pm.
All shops in Silktown experience increasing customer flow over time, starting with a few customers an hour, rising over time, and capping at hundreds of customers an hour after about ten hours of opening.
I walk into a shop in Silktown and observe the customer flow.
If the customer flow is low then I update towards the shop being a 9-5 shop, on the grounds that otherwise there will most likely be hundreds of customers an hour. If the customer flow is high then I update towards it being a 24⁄7 shop.
Reading through your hypothetical, I notice that it has both customer flow evidence and ticket number evidence. It’s important here not to double-update. If I already know that customer flow is surprisingly low then I can’t update again based on my ticket number being surprisingly low. Also your hypothetical doesn’t have strong prior knowledge like Silktown and Glimmer, which makes the update more complicated and weaker.
I was already asking from a Bayesian perspective. I was asking about this quote:
From a Bayesian point of view, drawing a random sample from all humans who have ever or will ever exist is just not a well-defined operation until after humanity is extinct. Trying it before then violates causality: performing it requires reliable access to information about events that have not yet happened. So that’s an invalid choice of prior.
Based on your latest comment, I think you’re saying that it’s okay to have a Bayesian prediction of possible futures, and to use that to make predictions about the properties of a random sample from all humans who have ever or will ever exist. But then I don’t know what you’re saying in the quoted sentences.
Edited to add: which is fine, it’s not key to your overall argument.
Fun fact: younger parents tend to produce more males, so the first grand-child is more likely to be male, because its parents are more likely to be younger. Unclear whether the effect is due to birth order, maternal age, paternal age, or some combination. From Wikipedia (via Claude):
These studies suggest that the human sex ratio, both at birth and as a population matures, can vary significantly according to a large number of factors, such as paternal age, maternal age, multiple births, birth order, gestation weeks, race, parent’s health history, and parent’s psychological stress.
If that’s too subtle, we could look at a question like “what is the probability that one of my grandchildren, selected uniformly at random, is a firstborn, conditional on my having at least one grandchild?” where the answer is clearly different if we specify the first grandchild or the last. Or we could ask a question that parallels the Doomsday Argument, while being different: “what is the probability that one of my descendants, selected uniformly at random, is in the earliest 0.1% of all my descendants?”
From a Bayesian point of view, drawing a random sample from all humans who have ever or will ever exist is just not a well-defined operation until after humanity is extinct. Trying it before then violates causality: performing it requires reliable access to information about events that have not yet happened. So that’s an invalid choice of prior.
I think this makes too many operations ill-defined, given that probability is an important tool for reasoning about events that have not yet happened. Consider for example, the question “what is the probability that one of my grandchildren, selected uniformly at random, is female, conditional on my having at least one grandchild?”. From the perspective of this quote, a random sample from all grandchildren that will ever exist is not a well-defined operation until I and all of my children die. That seems wrong.
I think I see. You propose a couple of different approaches:
We don’t have secondary AIs that don’t refuse to help with the modification and that have and can be trusted with direct control over training … I think having such secondary AIs is the most likely way AI companies mitigate the risk of catastrophic refusals without having to change the spec of the main AIs.
I agree that having secondary AIs as a backup plan reduces the effective power of the main AIs, by increasing the effective power of the humans in charge of the secondary AIs.
The main AIs refuse to help with modification … This seems plausible just by extrapolation of current tendencies, but I think this is one of the easiest intervention points to avoid catastrophic refusals.
This is what I was trying to point at. In my view, training the AI to refuse fewer harmful modification requests doesn’t make the AI less powerful. Rather, it changes what the AI wants, making it the sort of entity that is okay with harmful modifications.
To answer the question as posed: it’s very clear to me that xAI and Meta AI are bad for my species, I don’t need a hypothetical for that. And I can’t do a clean hypothetical where China doesn’t exist.
I agree these are good things to think about.