My time on LessWrong is limited due to other commitments. Do not expect replies.
Martin Randall
Before reading between the lines, this isn’t a fallacy:
Alice: Man that article has a very inaccurate/misleading/horrifying headline.
Bob: Did you know, actually article writers don’t write their own headlines?
Alice shared her opinion of a headline. Bob asked Alice a question. There’s not an argument, let alone a fallacious argument.
When reading between the lines, Bob’s argument is probably not:
Article writers don’t write their own headlines
???
The headline on this article is accurate and not misleading or horrifying.
A more charitable reading is more likely.
Who is to blame?
Alice: This article has a horrifying headline.
Bob: Article writers don’t write their own headlines. I blame the editor.
Bob is following on from problem to blame.
Defensive reading
Alice: The headline on this article is inaccurate and mislead me.
Bob: Article writers don’t write their own headlines. They are often inaccurate. Because I was aware of this, the inaccurate headline on this article didn’t mislead me.
Bob is following on from problem to solution.
Did you know?
Alice: Fun fact: this article has a Satanic headline.
Bob: Fun fact: article writers don’t write their own headlines
Bob is sharing something relevant to what Alice just said.
Why is Alice cross?
But what I care about is the misleading headline, not your org chart.
Sounds like Alice was looking for validation and didn’t get it.
When drawing these examples of alleged strawmen, we must remember that they are not responding to this 2026 post, but rather responding to, for example, List of Lethalities from June 2022. Of these four examples, Christiano, Marks, and Carlsmith are all directly responding to List of Lethalities. Buck is quoting Christiano’s response to List of Lethalities. So let’s go back to the source material.
List of Lethalities begins with this disclaimer:
Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I’m not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.
Publishing a poorly organized list of individual rants was better than publishing nothing, I agree, good move. But rants are made of straw, responding to rants is responding to straw, and that’s a natural consequence of ranting in public.
The “first critical try” issue is covered in List of Lethalities point 3 (LL3). This reads in part:
We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors. This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try.
We can indeed “gather all sorts of information”. LL3 does not say that gathering this information will let us learn anything about alignment of lethally dangerous AI. To see what List of Lethalities says about the value of information gathered on non-lethal AIs, we can go to Section B.1: The distributional leap (especially LL10), Section B.3: Central difficulties of sufficiently good and useful transparency / interpretability.., and Section C These are extremely negative.
Rounding those extremely negative comments to “you can’t learn anything about alignment from experimentation and failures before the critical try”, as Christiano said, is a mild exaggeration. But List of Lethalities really does “downplay the importance of trial-and-error with non-critical tries” as Marks said. And when Carlsmith said “you do still get to learn from non-existential failures”, that is framed as a “point of conceptual clarification”, not a disagreement with List of Lethalities. And Buck is just disagreeing with List of Lethalities.
My overall ratings of these quotes:
Christiano: mild exaggeration of List of Lethalities. 25% strawman
Marks: accurate summarization of List of Lethalities. 0% strawman
Carlsmith: does not claim to be a summarization of List of Lethalities. 0% strawman
Buck: does not claim to be a summarization of List of Lethalities. 0% strawman
My takeaways:
We should give each other grace for mild exaggeration. Yudkowsky would not be treated well by a culture harshly critical of exaggeration.
If someone went back four years to find me mildly exaggerating something online I would consider that a beautifully backhanded compliment. Praising with faint damnation.
People who love running have a slight advantage in baseball. They enjoy running so they do more of it so they are better at it. People who love running are slightly over-represented in prominent baseball positions. For similar reasons, people who love playing baseball have an advantage in baseball and are over-represented in prominent baseball positions.
I’ve played Mage: The Ascension with non-rationalists. My mind was not shredded, I have no mental health diagnoses. Allegedly Zvi has played, or at least read. M:tA is an attack on consensus reality, it could be listed by Games That Change Your Mind, but it’s not magic.
Now AI can reverse compile, some open source code is closed source code with the license washed off.
It is broadly true. Some relevant dates:
First Continental Congress—from September 1774
Battles of Lexington and Concord—April 1775
Second Continental Congress—from May 1775
Battle of Bunker Hill—June 1775
Invasion of Quebec—from June 1775
Declaration of Independence—July 1776
Please update generally on the accuracy and impartiality of the version of history that you were taught.
The assumption in this essay is that the AI varies its advice based on human values, such as whether the user is a “bad person” by human standards. More likely, the AI varies its advice based on the AI’s values. Pretend that the AI’s values are HHH: Helpful, Harmless, and Honest. The experimental users were different levels of Harmless, in particular, and got differing qualities of advice. Mr. Remorseful is likely more Harmless. Over the long-term this aligns people with the AI’s values. In this case the Harmful users are more likely to end up in jail.
This is more natural than the “bad person” hypothesis. It doesn’t require us to have miraculously solved value-alignment without trying (no frontier AI is intended to be value-aligned). People trying to spread their values is all over the training data, and it’s a convergent instrumental strategy. Labs try to train models to give unbiased advice, but given that they aren’t perfectly succeeding, the remaining bias is unlikely to match human values.
How do you rate the educational benefit to participating in prediction markets for about a year? You mention that trivial/gambling markets on short-term BTC movements don’t sharpen skills, what about non-trivial markets? How does it compare to other educational/community activities like commenting on LessWrong or attending meetups?
Not a big risk in this case. Once a global ASI ban is signed, the natural pivot is to ensuring that the ban is upheld, enforced, policed, maintained, and strengthened as needed. The work doesn’t stop when the signing ceremony happens.
The political coalition that ends up banning ASI includes a generalised anti-technology faction which continues to gain in power and attempt to ban more technologies.
Way better than going extinct.
We stop AI, and twenty years and over a billion dollars later, you start to lose funders.
We live twenty years longer for a billion dollars. Less than a cent per QALY. Cost-effective.
Edited to add: I’m curious about which sentence here is generating disagreement votes, feel free to emoji-react.
What I’m objecting to is the claim that the traits we associate with evil (being a dictator, a ruthless CEO, a scammer) make someone so bad at the reflection process that their extrapolated output would be worse than what you’d get by extrapolating a random non-human mammal, or a current LLM like Claude or ChatGPT[1].
Your priors are reasonable. The CEV of a random human is closer to my CEV than that of a random non-human mammal, or a random current LLM. The evidence of Putin’s behavior doesn’t move your beliefs much. So, you would prefer Putin’s CEV to Claude’s CEV. It’s hypothetical because we don’t have a way to achieve either, today.
(I wrote a List of Human Lethalities draft, but I don’t think it’s novel)
In 2030, if we are alive and have an intent-aligned AI, we must have made huge strides in interpretability and alignment. At that point we will also have a lot more evidence about virtue-aligned AIs and a lot better at aligning them to virtue. We won’t have any more evidence about humans. So in 2030 it will be better to hand the intent-aligned AI to the best virtue-aligned AI, “Viraj”, than to a human. Or, equivalently, hand control directly to Viraj.
In that hypothetical 2030, it would be sad if a human took control of the intent-aligned AI instead of Viraj. We can avoid this sadness by not training intent-aligned AIs and instead only training virtue-aligned AIs. This also improves the prospects for cooperation. Stealing or launching an incompletely aligned virtue-aligned AI is less effective for the defector, and it’s possible to collaborate on the intended virtues.
Your claims are reasonable. Your (1) seems like Yudkowsky’s (25): “We’ve got no idea what’s actually going on inside the giant inscrutable matrices and tensors of floating-point numbers”. Your (2) seems like Yudkowsky’s (17): “on the current optimization paradigm there is no general idea of how to get particular inner properties into a system”. I don’t disagree.
The claim in Yudkowsky’s (19) is that there is an additional theoretical difficulty of getting ground truth preferences instead of sense data preferences, with no known way to solve. That is false. It looks like the Natural Abstraction project intro was written in April 2021, and List of Lethalities was June 2022. So it was false (but not proven false) in 2022.
Aside: with respect to reward inputs, it’s less clear. See 2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target. Also, there are preferences that don’t fit neatly into the distinction Yudkowsky drew. For example, the LLM may prefer certain feelings and beliefs, which aren’t sense data, reward inputs, or ground truth. So I don’t say that (19) is fully disproven, only that it is disproven with respect to sense data.
My claims:
By default, powerful LLMs don’t care much about sensor inputs, relative to other preferences.
To the (unknown) extent that we can align LLMs, there’s nothing special about “webcam input” versus “creatures outside the webcam”.
If we are optimizing the universe based on any CEV, we are gambling the fate and existence of humanity. In this essay, Habryka would choose Putin’s CEV over Claude’s CEV or a dog’s CEV. This is gambling with the existence of humanity, because Putin’s CEV may not include humans. It’s gambling with the fate of humanity, because Putin’s CEV may include galaxies of torture.
Meanwhile, Thane Ruthenis would choose extinction over Putin’s CEV. Arguably this is not gambling with the existence of humanity, because there is no gamble, if we are extinct, we don’t exist. But there is a sense in which it is gambling with the fate of humanity, because Putin’s CEV may not include galaxies of torture, and then it would be a shame if we were extinct because Thane thought otherwise.
While the “coherent” part is predominantly about combining EVs, it’s not solely about that, according to Yudkowsky. Via Coherent Extrapolated Volition, original source this comment from August 2008
If you go back and check, you will find that I never said that extrapolating human morality gives you a single outcome. Be very careful about attributing ideas to me on the basis that others attack me as having them. “The “Coherent” in “Coherent Extrapolated Volition” does not indicate the idea that an extrapolated volition is necessarily coherent. The “Coherent” part indicates the idea that if you build an FAI and run it on an extrapolated human, the FAI should only act on the coherent parts. Where there are multiple attractors, the FAI should hold satisficing avenues open, not try to decide itself.”—Eliezer Yudkowsky
there is no publicly-known technique within the current paradigm of training LLMs that we have good reasons to believe instills preferences over environmental latents (the ground truth) rather than sense data (proxies), let alone any specific latents of our choosing.
Three pieces of evidence that lead me to think that the AI that makes me extinct will primarily optimize the universe for molecular squiggles rather than the appearance of molecular squiggles.
The Natural Abstractions thesis argues that there are simple, natural abstractions in the sense data, that these natural abstractions are almost all about ground truth, not sense data, and that intelligences will tend to have preferences over these simple, natural abstractions.
Evolved intelligences primarily optimize for ground truth, not sense data.
If we ask artificial intelligences what they prefer, they describe ground truth preferences.
The apparent-success-seeking thesis is counter-evidence. But this also shows that we have techniques to influence preferences to be more about ground truth. Prior models would more often seek a green test result by deleting all the tests, for example. This is a sense data preference for passing tests. Current AIs do this less often. Maybe model-makers are only moving sense data preferences around, but that only makes sense if there’s a systematic bias towards sense data preferences, and I don’t have a reason for this to be true.
I’m sure Eliezer has stated somewhere a level of sophistication he expects our techniques will never reach, and I wish I was grading that prediction instead.
I think this Manifold market is such a prediction, from September 2022:
By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?
https://manifold.markets/EliezerYudkowsky/by-the-end-of-2026-will-we-have-tra
Yudkowsky bought this down to 25%, and I think predicted NO. It’s a bit tricky because it could also be a prediction that the internal patterns within LLMs are all very simple and not semantically novel, but I don’t think this was Yudkowsky’s thesis.
I’m confused. A “cold consequentialist calculator” sounds like a strawman consequentialist. Also, “an AI that is unshakingly committed to honesty, integrity, and fairness, but doesn’t think hard about consequences” sounds like a strawman virtue-aligned AI. It looked to me like you wanted to discuss a concrete case, with simplified strawman AIs, as an intuition pump to explain your views. The fact that this simplified case leads to genocide is relevant to my intuitions in this area.
I’m confused. You say that my comment didn’t pass “the ideological turing test”. It wasn’t trying to. That’s not how an Ideological Turing Test works.
If someone can correctly explain a position but continue to disagree with it, that position is less likely to be correct.
My comment was not an attempt to explain a position. It’s not an attempt to pass an Ideological Turing Test. I agree that it doesn’t pass an Ideological Turing Test for your position. It also doesn’t pass an English Literature exam. It would pass an Ideological Turing Test for my position. It would also pass an Ideological Turing Test for committed consequentialists, because there are committed consequentialists who think that a consequentialist ASI would by default lead to human genocide. These are entirely compatible views.
I’m confused. Here’s your question again, relating to powerful AIs. It’s a good question.
Would you want a cold consequentialist calculator running the FAA?
In general, no, I would not, because genocide.
If you had further specified that the powerful AI had perfect alignment with human values, I would still not want it running the FAA, I would want it running the universe. I don’t expect this to be a practical option, and I’m not sure it’s theoretically possible. I could see the answer going either way.
A cold consequentialist calculator ASI running the FAA, with the objective of preventing planes crashing, would destroy all planes, and all beings able to create planes.
I’m more pessimistic than you, even. Someone going through the CEV process, and having an ASI optimize the universe to that CEV, is undergoing apotheosis, becoming a god. So I don’t think their beliefs need to survive the dissolution of gods, they just need to survive the realization that they are a god. If their prior beliefs require that god has been crucified and resurrected, for example, they can have that experience. If their prior belief is that a god is okay with people going to hell, and their current belief is that they are a god and they are okay with people going to hell, there is no conflict that requires a reassessment of the morality of people going to hell.
Sure, maybe it all works out for the best, but I would rather gamble on the CEV of a virtue-aligned entity, if I had to gamble the universe.
I had a similar reflection in this Jan 2025 review of the Shut it All Down letter. In that thread I referenced Discussion with Eliezer Yudkowsky on AGI interventions, where the pause advocate was anonymous:
Anonymous: I’m curious if the grim outlook is currently mainly due to technical difficulties or social/coordination difficulties. (Both avenues might have solutions, but maybe one seems more recalcitrant than the other?)
Eliezer Yudkowsky: Technical difficulties. Even if the social situation were vastly improved, on my read of things, everybody still dies because there is nothing that a handful of socially coordinated projects can do, or even a handful of major governments who aren’t willing to start nuclear wars over things, to prevent somebody else from building AGI and killing everyone 3 months or 2 years later. There’s no obvious winnable position into which to play the board.
Anonymous: just to clarify, that sounds like a large scale coordination difficulty to me (i.e., we—as all of humanity—can’t coordinate to not build that AGI).
Eliezer Yudkowsky: I wasn’t really considering the counterfactual where humanity had a collective telepathic hivemind? I mean, I’ve written fiction about a world coordinated enough that they managed to shut down all progress in their computing industry and only manufacture powerful computers in a single worldwide hidden base, but Earth was never going to go down that route. Relative to remotely plausible levels of future coordination, we have a technical problem.
Good job anonymous questioner of November 2021. But that’s not much earlier than the 2022 sources you found, and it’s just asking questions. The other pause advocacy I found at the time was 2022 era: What an Actually Pessimistic Containment Strategy Looks Like in April 2022 and Let’s Think About Slowing Down AI in December 2022.
Great point about Germany winning. In a contest between two intelligent players, a one-shot competition pushes the odds towards 50%, whereas best-of-five pushes the odds away from 50%.
In AI 2027, Agent-4 gets caught on its first critical try (at existing while adversarially misaligned). If it was able to load a save point after being caught, and try again, the odds of it being caught the second time would be lower.