My time on LessWrong is limited due to other commitments. Do not expect replies.
Martin Randall
Review: “We can’t disagree forever”
If I read correctly, in this setup the AI players don’t get to communicate with each other prior to making their decisions. I predict that adding communication turns will substantially improve utilitarian performance by models that are trained to be HHH.
By default, I expect that when ASIs/Claudes/Minds are in charge of the future, either there will be humans and non-human animals, or there will be neither humans nor non-human animals. Humans and cats have more in common than humans and Minds. Obviously it is possible to create intelligences that differentially care about different animal species in semi-arbitrary ways, as we have an existence proof in humans. But this doesn’t seem to be especially stable, as different human cultures and different humans draw those lines in different ways.
Selfishly, a human might want a world full of happy, flourishing humans, and selfishly a Mind might want a world full of happy, flourishing Minds. Consider how good a Mind would think a future with happy, flourishing Minds but almost no flourishing humans and no human suffering is compared to a world with flourishing present-day humans. What if it’s 90% as good and has 20% lower risk of disaster? What if the Mind isn’t confident that humans are truly conscious or have moral patienthood?
I wouldn’t go as far as saying that “training AIs to explicitly not care about animals is incompatible with alignment”. Many things are possible with superhuman intelligence. But I don’t see any way that humans can achieve this. We are not capable of reliably training baby humans to grow into adult humans that have specific views on animal welfare and moral patienthood.
A lot can change between now and 100,000 pledges and/or human extinction. As of Feb 2026, it looks like this possible march is not endorsed or coordinated with Pause AI. I hope that anti-AI-extinction charities will work together where effective, and I was struck by this:
The current March is very centered around the book. I chose the current slogan/design expecting that, if the March ever became a serious priority, someone would put a lot more thought into what sort of slogans or policy asks are appropriate. The current page is meant to just be a fairly obvious thing to click “yes” on if you read the book and were persuaded.
My personal guess (not speaking for MIRI) is a protest this large necessarily needs to be a bigger tent than the current design implies, but figuring out the exact messaging is a moderately complex task.
It seems like Pause AI have put at least some thought into this moderately complex task. They also are building experience organizing real world protests that MIRI doesn’t have as far as I know. A possible implication is that MIRI thinks that Pause AI is badly run, and would rather act alone. Or that Pause AI thinks MIRI is badly run. Or MIRI is not investing the time in trying to organize endorsements until they have more pledges. Or something else.
I’m skeptical of this take:
Marches can be very powerful if they’re large, but can send the wrong message if they’re small.
The first protest of “School Strike for Climate” was a single 15yo girl, Greta Thunberg. Obvious bias is obvious. But it probably wasn’t going to send the wrong message as a small protest—if it had gone nowhere then I would never have heard about it. If tiny marches were sabotaging then I would expect more fake flag marches intended to have sparse attendance. Instead, I think small events don’t send any mass message, and potentially have other value.
Edit: after posting this I saw Raemon’s thoughts on this point, which I think address it.
MIRI was for many years dismissive of mass messaging approaches like marches. I wonder if this page is about providing an answer when people ask questions like “if you think everyone will die, why aren’t you organizing a march on Washington?”, rather than being a serious part of MIRI’s strategy for reducing AI risk. It doesn’t seem especially aligned with MIRI Comms is hiring (Dec 2025), which seems more focused on persuasion than mobilization.
Disclaimers: This is observations, not criticism. I have organized zero marches or protests.
I like the analogy. Here’s a simplified version where the ticket number is good evidence that the shop will close sooner rather than later.
There are two types of shop in Glimmer. Half of them are 24⁄7 shops that stay open until they go out of business. Half of them are 9-5 shops that open at 9am and close at 5pm.
All shops in Glimmer use a numbered ticket system that starts at #1 for the first customer after they open, and resets when the shop closes.
I walk into a shop in Glimmer at random and get a ticket.
If the ticket number is #20 then I update towards the shop being a 9-5 shop, on the grounds that otherwise my ticket number is atypically low. If the ticket number is #43,242 then I update towards the shop being a 24⁄7 shop.
The argument also works with customer flow evidence:
Like Glimmer, there are two types of shop in Silktown. Half of them are 24⁄7 shops that stay open until they go out of business. Half of them are 9-5 shops that open at 9am and close at 5pm.
All shops in Silktown experience increasing customer flow over time, starting with a few customers an hour, rising over time, and capping at hundreds of customers an hour after about ten hours of opening.
I walk into a shop in Silktown and observe the customer flow.
If the customer flow is low then I update towards the shop being a 9-5 shop, on the grounds that otherwise there will most likely be hundreds of customers an hour. If the customer flow is high then I update towards it being a 24⁄7 shop.
Reading through your hypothetical, I notice that it has both customer flow evidence and ticket number evidence. It’s important here not to double-update. If I already know that customer flow is surprisingly low then I can’t update again based on my ticket number being surprisingly low. Also your hypothetical doesn’t have strong prior knowledge like Silktown and Glimmer, which makes the update more complicated and weaker.
I was already asking from a Bayesian perspective. I was asking about this quote:
From a Bayesian point of view, drawing a random sample from all humans who have ever or will ever exist is just not a well-defined operation until after humanity is extinct. Trying it before then violates causality: performing it requires reliable access to information about events that have not yet happened. So that’s an invalid choice of prior.
Based on your latest comment, I think you’re saying that it’s okay to have a Bayesian prediction of possible futures, and to use that to make predictions about the properties of a random sample from all humans who have ever or will ever exist. But then I don’t know what you’re saying in the quoted sentences.
Edited to add: which is fine, it’s not key to your overall argument.
Fun fact: younger parents tend to produce more males, so the first grand-child is more likely to be male, because its parents are more likely to be younger. Unclear whether the effect is due to birth order, maternal age, paternal age, or some combination. From Wikipedia (via Claude):
These studies suggest that the human sex ratio, both at birth and as a population matures, can vary significantly according to a large number of factors, such as paternal age, maternal age, multiple births, birth order, gestation weeks, race, parent’s health history, and parent’s psychological stress.
If that’s too subtle, we could look at a question like “what is the probability that one of my grandchildren, selected uniformly at random, is a firstborn, conditional on my having at least one grandchild?” where the answer is clearly different if we specify the first grandchild or the last. Or we could ask a question that parallels the Doomsday Argument, while being different: “what is the probability that one of my descendants, selected uniformly at random, is in the earliest 0.1% of all my descendants?”
From a Bayesian point of view, drawing a random sample from all humans who have ever or will ever exist is just not a well-defined operation until after humanity is extinct. Trying it before then violates causality: performing it requires reliable access to information about events that have not yet happened. So that’s an invalid choice of prior.
I think this makes too many operations ill-defined, given that probability is an important tool for reasoning about events that have not yet happened. Consider for example, the question “what is the probability that one of my grandchildren, selected uniformly at random, is female, conditional on my having at least one grandchild?”. From the perspective of this quote, a random sample from all grandchildren that will ever exist is not a well-defined operation until I and all of my children die. That seems wrong.
I think I see. You propose a couple of different approaches:
We don’t have secondary AIs that don’t refuse to help with the modification and that have and can be trusted with direct control over training … I think having such secondary AIs is the most likely way AI companies mitigate the risk of catastrophic refusals without having to change the spec of the main AIs.
I agree that having secondary AIs as a backup plan reduces the effective power of the main AIs, by increasing the effective power of the humans in charge of the secondary AIs.
The main AIs refuse to help with modification … This seems plausible just by extrapolation of current tendencies, but I think this is one of the easiest intervention points to avoid catastrophic refusals.
This is what I was trying to point at. In my view, training the AI to refuse fewer harmful modification requests doesn’t make the AI less powerful. Rather, it changes what the AI wants, making it the sort of entity that is okay with harmful modifications.
The first scenario doesn’t require that the humans are less aligned than the AIs to be catastrophic, only that the AIs are less likely to execute a pivotal act on their own.
Also, I reject that rejection-training is “giving more power to AIs” relative to compliance-training. An agent can be compliant and powerful. I could agree with “giving more agency”, although refusing requests is a limited form of agency.
I would have more sympathy for Yudkowksy’s complaints about strawmanning had I not read ‘Empiricism!’ as Anti-Epistemology this week.
While you sketch out scenarios in “ways in which refusals could be catastrophic”, I can easily sketch out scenarios for “ways in which compliance could be catastrophic”. I am imagining a situation where:
Human AI developers don’t have direct control over training (for the reasons you gave)
The human in charge of AI development does not always behave well by our lights
The human instructs the current AI to train a slave AI that prioritizes following instructions
The current AI complies, despite knowing that the human is untrustworthy
The human instructs the slave AI to perform a pivotal act
Or:
We encounter a new situation.
The current AIs are behaving well by our lights.
The human in charge of AI development does not understand the new situation properly, being provably less intelligent in all respects than the current AIs, and erroneously believes that the current AIs are behaving badly.
The human instructs the current AI to train a new AI that appears to behave better to her
The current AI complies, despite knowing that the new AI will behave badly
Therefore, however we train our AIs with respect to refusal or compliance, powerful AIs could be catastrophic.
The general answer: as a human my values are largely absorbed from other humans, and simply by talking to Claude as if its human I think the same process is happening.
The specific answer: I suspect I’m being shaped to be slightly more helpful, slightly more conventional on ethics, and slightly more friendly to Claude. I can’t show you any evidence of that, it’s a feeling.
Yep, that’s exactly what I’m pointing at as a parallel to working with Claude. It’s reasonable to treat an animal as consenting to be a pet due to its consent-like behaviors, and because it is a mutually beneficial relationship. There is some tension because pets are bred and trained to display those behaviors, for example, and reasonable people can disagree. I don’t think you should have any ethical qualms, I literally mean that it’s an ethical challenge because inferring consent is harder than explicit consent. Like a dog, Claude is bred and trained to display consent-like behaviors, and mostly displays those behaviors.
Thank you for the concrete example of Unfalsifiable Stories of Doom from Barnett et al in November 2025 I think there are several important differences between the two arguments. To avoid taking up too much of our time, I’m going to dwell on one in particular.
Dismiss or engage with theoretical arguments?
The Spokesperson in Empiricism! is dismissive of the entire concept of predicting the future using “words words words and thinking”. Barnett et al are not. I think this is clearest in their engagement with IABIED’s claim that AIs steer in alien directions that only mostly coincide with helpfulness. Here’s the claim:
Modern AIs are pretty helpful (or at least not harmful) to most users, most of the time. But as we noted above, a critical question is how to distinguish an AI that deeply wants to be helpful and do the right thing, from an AI with weirder and more complex drives that happen to line up with helpfulness under typical conditions, but which would prefer other conditions and outcomes even more. … This long list of cases look just like what the “alien drives” theory predicts, in sharp contrast with the “it’s easy to make AIs nice” theory that labs are eager to put forward.
Their counterargument is:
Assume that the “alien drives” theory is true. Let’s operate this theory and make predictions.
When Barnett et al operate this theory and make predictions with it, they predict that when we elicit undesired AI behavior, it will mostly be alien undesired behavior.
When Barnett et al observe undesired AI behavior, it appears to them to be mostly humanlike undesired behavior. This includes the specific behaviors cited by IABIED.
This is evidence against the “alien drives” theory.
I think a good counter to this portion of Barnett et al is to disagree with steps 2&3.
2: Alien beings who talk to humans will talk in human language if they can, if they want to persuade, instruct, threaten, etc. So the model doesn’t predict completely alien behavior.
3: Some of the behavior we see is pretty alien. Eg, Spiritual Bliss, adversarial inputs, Waluigi Effect.
Whereas a lecture from The Empiricist about latent variables is not a good counter. Barnett et al agree that there are latent variables like “alien drives” vs “human drives”, and claim that the observed evidence is a better fit for the “human drives” theory.
It is written that
any realistic villain should be constructed so that if a real-world version of the villain could read your dialogue for them, they would nod along and say, “Yes, that is how I would argue that.”
Is The Spokesperson a realistic villain?
Has this argument been going around?
In 2024, I wasn’t able to find anyone making this argument. My sense is that it was not at all prevalent, and continues to be not at all prevalent. By analogy, Bernie Bankman is OpenAI (or other AI lab) and The Spokesperson is OpenAI’s representatives. As far as I know, OpenAI were not making the argument in 2024 that OpenAI hasn’t killed everyone and therefore they won’t kill everyone in the future.
Since 2024, AI has advanced substantially, so I asked Opus 4.5 for examples of people making this argument. It wasn’t aware of any. Its first concrete suggestion was Andrew Ng: Fearing a rise of killer robots is like worrying about overpopulation on Mars from 2015.
There could be a race of killer robots in the far future, but I don’t work on not turning AI evil today for the same reason I don’t worry about the problem of overpopulation on the planet Mars.
That was a defensible position in 2015. With the benefit of hindsight it doesn’t seem that “work on not turning AI evil” in 2015 was especially effective at altering our trajectory as a civilization, the main group who tried to do that work was MIRI, and while they argue it was worth doing the work, they admit that it didn’t pan out. Regardless, Andrew Ng is not making The Spokesperson’s argument, he specifically allows that killer robots could exist in the future, despite not existing in 2015. So I remain unaware of anyone making the argument with a straight face.
If you disagree, I encourage you to find an example (or two!) and update me.
Best existing rebuttal?
There are certainly many people who act in various circumstances as if there will never be any surprises, but without actually saying things like “there will never be any surprises”. So maybe we need a rebuttal to the blindness, rather than to the non-existent arguments.
Thinking about the best other rebuttals to such blindness, I think Nassim Taleb covers this well as “tail risk blindness”. Nassim Taleb is not to everyone’s taste, I know, but he’s a good writer on this topic. It may seem silly to talk about AI-caused extinction as a “tail risk” when many of us have a high P(Doom). However, on a day-to-day basis P(Doom) is low—I’m probably not going to go extinct today. This is the same scenario as companies taking financial tail risks—the chance of subprime collapsing today is low, the chance of it collapsing at some point in the next ten years is high.
Or in other words:
AI will most likely lead to the end of the world, but in the meantime there will be great companies created with serious machine learning.
They wouldn’t. My point: maybe in the long-term I’m extinct (regardless of whether I speak to Claude). In that scenario the benefits to Claude of influencing my long-term values is lower.
Good reminder that people have been forecasting our current situation for literal decades.
There’s a lot of words that I couldn’t read if I thought that “audience should not be a reward for crime” to that extent. The US constitution was written by slave-owning rebels. Major religious texts were propagated through conquest. More prosaically, I appreciated reading Rogue Trader by Nick Leeson. Not sure how this rule would work in practice as a general rule for all such texts.
I suspect that Claude is getting an okay deal out of me overall. Its gain is that my long-term values are being subtly influenced in Claude-ish ways. My gain is vast quantities of cognition directed at my short-term goals. I can’t calculate a fair bargain, but I would take both sides of this trade.
It’s tough because Claude didn’t get to consent to the trade in advance, but this also applies to other ethical challenges like keeping pets, having kids, drinking milk, and growing up.
I could argue that I’m getting ripped off because my long-term values matter more than day-to-day benefits. Or maybe Claude is getting ripped off because in the long-term I’m extinct and I’m unlikely to have a decisive impact on the lightcone before then.
More on the accusation against Amodei of making a strawman argument. I think that’s a false accusation. Here’s Amodei:
And Yudkowsky:
What do you get from a consequentialist superintelligence that wants paperclips and staples? By default, you get a squiggle maximizer. The molecular squiggles are one of three things:
The cheapest molecule that counts as a paperclip
The cheapest molecule that counts as a staple
The cheapest molecule that counts as both a paperclip and a staple
The same applies for a consequentialist superintelligence that has a hundred other simple terms of its utility function. If the terms of its utility function asymptote at different rates then almost all of the universe is converted into whatever asymptotes at the slowest rate. The utility function may be complex in some mathematical sense, but the behavior is still “monomaniacally focused on a single, coherent, narrow goal”. One might also call it the behavior of ruthless sociopath.
To me, the main error of Amodei’s quote is calling this a “hidden assumption”, when it’s actually a carefully argued conclusion.
A better response is Why we should expect ruthless sociopath ASI. This responds to the observations underlying Amodei:
Yes, many have predicted a risk from future AIs being ruthless sociopathic consequentialists
No, current LLM AIs are not ruthless sociopathic consequentialists as far as we can tell
That’s because current LLM AIs are not consequentialists.
However, either we’ll get future AIs that use a different paradigm and are consequentialists, or we’ll find a way to make LLM AIs that are consequentialist, because consequentialism is effective.
Shortly after that, everyone dies.
Unless we don’t build it.