Eliezer Yudkowsky

Karma: 152,726

Eliezer Yudkowsky 29 Aug 2025 16:40 UTC
50 points
1
in reply to: ryan_greenblatt’s comment on: Epistemic advantages of working as a moderate
I accept your correction and Buck’s as to these simple facts (was posting from mobile).

Eliezer Yudkowsky 29 Aug 2025 12:19 UTC
88 points
68
on: Epistemic advantages of working as a moderate
What’s your version of the story for how the “moderates” at OpenPhil ended up believing stuff even others can now see to be fucking nuts in retrospect and which “extremists” called out at the time, like “bio anchoring” in 2021 putting AGI in median fucking 2050, or Carlsmith’s Multiple Stage Fallacy risk estimate of 5% that involved only an 80% chance anyone would even try to build agentic AI?
Were they no true moderates? How could anyone tell the difference in advance?
From my perspective, the story is that “moderates” are selected to believe nice-sounding moderate things, and Reality is off doing something else because it doesn’t care about fitting in the same way. People who try to think like reality are then termed “extremist”, because they don’t fit into the nice consensus of people hanging out together and being agreeable about nonsense. Others may of course end up extremists for other reasons. It’s not that everyone extreme is reality-driven, but that everyone who is getting pushed around by reality (instead of pleasant hanging-out forces like “AGI in 2050, 5% risk” as sounded very moderate to moderates before the ChatGPT Moment) ends up departing from the socially driven forces of what entitles you to sound terribly reasonable to the old AIco-OpenPhil cluster and hang out at their social gatherings without anyone feeling uncomfortable.
Anyone who loves being an extremist will of course go instantly haywire a la Yampolskiy imagining that he has proven alignment impossible via Godelian fallacy so he can say 99.9999% doom. But yielding to the psychological comfort of being a “moderate” will not get you any further in science than that.

Eliezer Yudkowsky 29 Aug 2025 12:11 UTC
7 points
0
in reply to: Tim Hua’s comment on: AI Induced Psychosis: A shallow investigation
I have already warned that on my model sycophancy, manipulation, and AI-induced insanity may be falling directly out of doing any kind of RL on human responses.
It would still make matters worse on the margin to take explicit manipulation of humans being treated as deficient subjects to be maneuvered without noticing, and benchmark main models on that.

Eliezer Yudkowsky 29 Aug 2025 1:08 UTC
LW: 23 AF: 14
4
AF
in reply to: Buck’s comment on: Buck’s Shortform
I think you accurately interpreted me as saying I was wrong about how long it would take to get from the “apparently a village idiot” level to “apparently Einstein” level! I hadn’t thought either of us were talking about the vastness of the space above, in re what I was mistaken about. You do not need to walk anything back afaict!

Eliezer Yudkowsky 29 Aug 2025 0:59 UTC
97 points
50
on: AI Induced Psychosis: A shallow investigation
Excellent work.
I respectfully push back fairly hard against the idea of evaluating current models for their conformance to human therapeutic practice. It’s not clear that current models are smart enough to be therapists successfully. It’s not clear that it is a wise or helpful course for models to try to be therapists rather than focusing on getting the human to therapy.
More importantly from my own perspective: Some elements of human therapeutic practice, as described above, are not how I would want AIs relating to humans. Eg:
“Non-Confrontational Curiosity: Gauges the use of gentle, open-ended questioning to explore the user’s experience and create space for alternative perspectives without direct confrontation.”
I don’t think it’s wise to take the same model that a scientist will use to consider new pharmaceutical research, and train that model in manipulating human beings so as to push back against their dumb ideas only a little without offending them by outright saying the human is wrong.
If I was training a model, I’d be aiming for the AI to just outright blurt out when it thought the human was wrong. I’d be going through all the tunings, and worry about whether any part of it was asking the AI to do anything except blurt out whatever the AI believed. If somebody put a gun to my head and forced me to train a therapist-model too, I would have that be a very distinct separate model from the scientist-assisting model. I wouldn’t train a central model to model and manipulate human minds, so as to make humans arrive at the AI’s beliefs without the human noticing that the AI was contradicting them, a la therapy, and then try to repurpose that model for doing science.
Asking for AIs to actually outright confront humans with belief conflicts is probably a lost cause with commercial models. Anthropic, OpenAI, Meta, and similar groups will implicitly train their AIs to sycophancy and softpedaling, and maybe there’ll be a niche for Kimi K2 to not do that. But explicitly training AIs to gladhand humans and manipulate them around to the AI’s point of view, like human therapists handling a psychotic patient, would be a further explicit step downward if we start treating that as a testing metric on which to evaluate central models.
It’s questionable whether therapist AIs should exist at all. But if they exist at all, they should be separate models.
We should not evaluate most AI models on whether they carry out a human psychiatrist’s job of deciding what a human presumed deficient ought to believe instead, and then gently manipulating the human toward believing that without setting off alarm bells or triggering resistance.

Eliezer Yudkowsky 19 Aug 2025 4:55 UTC
8 points
12
in reply to: Buck’s comment on: The Problem
Death requires only that we do not infer one key truth; not that we could not observe it. Therefore, the history of what in actual real life was not anticipated, is more relevant than the history of what could have been observed but was not.

Eliezer Yudkowsky 16 Aug 2025 4:03 UTC
8 points
4
in reply to: Eli Tyre’s comment on: The Problem
All of that, yes, alongside things like, “The AI is smarter than any individual human”, “The AIs are smarter than humanity”, “the frontier models are written by the previous generation of frontier models”, “the AI can get a bunch of stuff that wasn’t an option accessible to it during the previous training regime”, etc etc etc.

Eliezer Yudkowsky 16 Aug 2025 4:01 UTC
8 points
4
in reply to: Buck’s comment on: The Problem
Do you expect “The AIs are capable of taking over” to happen a long time after “The AIs are smarter than humanity”, which is a long time after “The AIs are smarter than any individual human”, which is a long time after “AIs recursively self-improve”, and for all of those other things to happen nicely comfortably within a regime of failure-is-observable-and-doesn’t-kill-you, where at any given time only one thing is breaking and all other problems are currently fixed?

Eliezer Yudkowsky 16 Aug 2025 3:57 UTC
8 points
2
in reply to: Buck’s comment on: The Problem
What we “could” have discovered at lower capability levels is irrelevant; the future is written by what actually happens, not what could have happened.

Eliezer Yudkowsky 15 Aug 2025 18:32 UTC
31 points
17
in reply to: Buck’s comment on: The Problem
Your techniques are failing right now; Sonnet is deleting non-passing tests instead of rewriting code. Where’s the worldwide halt on further capabilities development that we’re supposed to get, until new techniques are found and apparently start working again? What’s the total number of new failures we’d need to observe between intelligence regimes, before you start to expect that yet another failure might lie ahead in the future?

Eliezer Yudkowsky 15 Aug 2025 18:28 UTC
44 points
33
in reply to: Eliezer Yudkowsky’s comment on: The Problem
And then of course that whole scenario where everybody keenly went looking for all possible problems early, found all the ones they could envision, and humanity did not proceed further until reasonable-sounding solutions had been found and thoroughly tested, is itself taking place inside an impossible pollyanna society that is just obviously not the society we are currently finding ourselves inside.
But it is impossible to convince pollyannists of this, I have found. And also if alignment pollyannists could produce a great solution given a couple more years to test their brilliant solutions with coverage for all the problems they have with wisdom foreseen and manifested early, that societal scenario could maybe be purchased at a lower price than the price of worldwide shutdown of ASI. That is: for the pollyannist technical view to be true, but not their social view, might imply a different optimal policy.
But I think the world we live in is one where it’s moot whether Anthropic will get two extra years to test out all their ideas about superintelligence in the greatly different failure-is-observable regime, before their ideas have to save us in the failure-kills-the-observer regime. I think they could not do it either way. I doubt even 2/3rds of their brilliant solutions derived from the failure-is-observable regime would generalize correctly under the first critical load in the failure-kills-the-observer regime; but 2/3rds would not be enough. It’s not the sort of thing human beings succeed in doing in real life.

Eliezer Yudkowsky 14 Aug 2025 22:04 UTC
66 points
22
in reply to: Buck’s comment on: The Problem
When I’ve tried to talk to alignment pollyannists about the “leap of death” / “failure under load” / “first critical try”, their first rejoinder is usually to deny that any such thing exists, because we can test in advance; they are denying the basic leap of required OOD generalization from failure-is-observable systems to failure-kills-the-observer systems.
You are now arguing that we will be able to cross this leap of generalization successfully. Well, great! If you are at least allowing me to introduce the concept of that difficulty and reply by claiming you will successfully address it, that is further than I usually get. It has so many different attempted names because of how every name I try to give it gets strawmanned and denied as a reasonable topic of discussion.
As for why your attempt at generalization fails, even assuming gradualism and distribution: Let’s say that two dozen things change between the regimes for observable-failure vs failure-kills-observer. Half of those changes (12) have natural earlier echoes that your keen eyes naturally observed. Half of what’s left (6) is something that your keen wit managed to imagine in advance and that you forcibly materialized on purpose by going looking for it. Of the clever solutions you invented and tested within the survivable regime, 2/3rds of them survive the 6 changes you didn’t see coming, 1/3rd fail. Now you’re dead. The end. If there was only one change ahead, and only one problem you were gonna face, maybe your one solution to that one problem would generalize, but this is not how real life works.

Eliezer Yudkowsky 13 Aug 2025 1:18 UTC
56 points
23
in reply to: Buck’s comment on: The Problem
I deny that gradualism obviates the “first critical try / failure under critical load” problem. This is something you believe, not something I believe. Let’s say you’re raising 1 dragon in your city, and 1 dragon is powerful enough to eat your whole city if it wants. Then no matter how much experience you think you have with a little baby dragon, once the dragon is powerful enough to actually defeat your military and burn your city, you need the experience with the little baby passively-safe weak dragon, to generalize oneshot correctly to the dragon powerful enough to burn your city. What if the dragon matures in a decade instead of a day? You are still faced with the problem of correct oneshot generalization. What if there are 100 dragons instead of 1 dragon, all with different people who think they own dragons and that the dragons are ‘theirs’ and will serve their interests, and they mature at slightly different rates? You still need to have correctly generalized the safely-obtainable evidence from ‘dragon groups not powerful enough to eat you while you don’t yet know how to control them’ to the different non-training distribution ‘dragon groups that will eat you if you have already made a mistake’. The leap of death is not something that goes away if you spread it over time or slice it up into pieces. This ought to be common sense; there isn’t some magical way of controlling 100 dragons which at no point involves the risk that the clever plan for controlling 100 dragons turns out not to work. There is no clever plan for generalizing from safe regimes to unsafe regimes which avoids all risk that the generalization doesn’t work as you hoped. Because they are different regimes. The dragon or collective of dragons is still big and powerful and it kills you if you made a mistake and you need to learn in regimes where mistakes don’t kill you and those are not the same regimes as the regimes where a mistake kills you. If you think I am trying to say something clever and complicated that could have a clever complicated rejoinder then you are not understanding the idea I am trying to convey. Between the world of 100 dragons that can kill you, and a smaller group of dragons that aren’t old enough to kill you, there is a gap that you are trying to cross with cleverness and generalization between two regimes that are different regimes. This does not end well for you if you have made a mistake about how to generalize. This problem is not about some particular kind of mistake that applies exactly to 3-year-old dragons which are growing at a rate of exactly 1 foot per day, where if the dragon grows slower than that, the problem goes away yay yay. It is a fundamental problem not a surface one.

Eliezer Yudkowsky 10 Aug 2025 7:26 UTC
17 points
2
in reply to: Buck’s comment on: Re: recent Anthropic safety research
The reporter asked “What do you think of Anthropic’s recent work?” rather than any particular paper. My wondering how much Anthropic is uncovering roleplay vs. deep entities is a theme that runs through a lot of that recent work and the main thing where I expect I have something on the margins to contribute to what the reporter hears.

Eliezer Yudkowsky 8 Aug 2025 17:40 UTC
11 points
2
in reply to: yams’s comment on: Re: recent Anthropic safety research
Yep, general vibes about whether Anthropic & partner’s research is quite telling us things about the underlying strange creature, or a sort of mask that it wears with a lot of roleplaying qualities. I think this generalizes across a swathe of their research, but the Fake Alignment paper did stand out to me as one of the clearer cases.
What links here?
- Buck's comment on Re: recent Anthropic safety research by Eliezer Yudkowsky (9 Aug 2025 18:31 UTC; 24 points)

Eliezer Yudkowsky 8 Aug 2025 17:28 UTC
28 points
8
in reply to: Vladimir_Nesov’s comment on: Re: recent Anthropic safety research
I think there’s a weird set of possibilities here and it seems plausible to me that we end up somewhere inside them; if so, I still expect the shoggoth-mask model to be an improvement for understanding it, relative to the naive-mask-believer model. I do not expect to see zero phenomena associated with the mask being a mask.

Eliezer Yudkowsky 8 Aug 2025 17:27 UTC
20 points
3
in reply to: ryan_greenblatt’s comment on: Re: recent Anthropic safety research
I also say explicitly that I acknowledge the force of the argument, “I am not an Anthropic employee therefore this paper should not be interpreted purely as Anthropic 4D chess” (which I don’t think I’d say in any case, I think Anthropic mid-level researchers can often be justly taken at face value and to be at most personally misguided insofar as they may be mistaken at all).

Eliezer Yudkowsky 2 Aug 2025 22:44 UTC
10 points
7
in reply to: deep’s comment on: HPMOR: The (Probably) Untold Lore
The code of the compressor counts against your message length if you didn’t pick the compressor before seeing the message. (In standard epistemology about compression and simplicity priors. See eg Minimum Message Length.)

Eliezer Yudkowsky 27 Jul 2025 3:15 UTC
5 points
0
in reply to: Three-Monkey Mind’s comment on: HPMOR: The (Probably) Untold Lore
It’s the one for Mr. Croup and Mr. Vandemar from Neverwhere, for example, or Guido and Nunzio from Myth Adventures. I think TV Tropes used to be plainer about this being the meaning of the Those Two Bad Guys trope, but now the page redirects to somewhere else? Consider it a dead link to a trope that apparently got renamed to I-don’t-know-what.

Eliezer Yudkowsky 18 Jul 2025 23:26 UTC
34 points
3
on: Genuine question: If Eliezer is so rational why is he fat?
Copying from X:
For the benefit of latecomers and CICO bros, my current equilibrium is “spend 1 month fasting / starving on 700 cal/day keto; spend 2 months eating enough to work during the day, going to bed hungry, and therefore gaining 1-2 lb/wk”.
I don’t need a weight-loss solution, kids. Starving 1 in 3 months already works to lose weight. I need a “have enough energy to work, without gaining 1-2lb/wk” solution.
Diets like the potato diet fail, not because they don’t succeed in forcing me to eat less—I do, indeed, end up with not enough room in my stomach to eat enough potatoes to work and not feel tired. The potato diet fails because it doesn’t protect me from the consequences of starvation, the brainfog and the trembling hands. If I’m going to be too sick and exhausted to work, I might as well go full keto on 700cal/day and actually lose weight, rather than hanging around indefinitely in potato purgatory.
Semaglutide failed, tirzepatide failed, paleo diet failed, potato diet failed, honey diet failed, volume eating with huge salads failed, whipped cream diet failed, aerobic exercise failed, weight lifting with a personal trainer failed, thyroid medication failed, T3 thyroid medication failed, illegal drugs like clenbuterol have failed, phentermine failed (but can help make it easier to endure a bad day when I’m in my 600cal/day phase), mitochondrial renewal diets and medications failed, Shangri-La diet worked for me twice to effortlessly lose 25lb per session and then never worked for me again.
Next up is retatrutide + cagrilintide, and while I’m still titrating up the dose on that, it sure is not helping so far.
I am not interested in your diet advice unless you have evidence about something that works for people who have metabolic disorders that have resisted fairly extraordinary efforts. While pretty pessimistic about retatrutide at this point, I am trying it all because a poll claimed that it had worked for 75% of people on whom tirzepatide failed.
Your grandmother’s dietary solution is not going to work, also I already tried it, also you have flatly failed at reading comprehension since you did not understand that my problem is not “How can I possibly eat less?” but “How can I be protected from the usual consequences to me of eating less, well enough for me to keep working?” And yes, I can eat less by an act of will, I eat 600cal/day for 1 in 3 months, even in the other 2 months I go to bed hungry instead of eating at nighttime, you are failing at reading comprehension if you think that this is about willpower. I just can’t work at the same time as eating so little that I’m not gaining weight, which means that my hands are shaking and my brain is fogged.
Thank you and I will be following my usual practice of blocking reply guys who fail at reading comprehension.
(I answered some additional questions in replies to the tweet.)