1a3orn
What happens if a genuine critic comes on here and writes something. I agree that some criticism is bad, but what if it is in the form that you ask for (lists of considerations, transparently written)?
Is the only criticism worth reading that which is actually convincing to you? And won’t, due to some bias, that likely leave this place an echo chamber?
More than one (imo, excellent) critic of AI doom pessimism has cut back on the amount they contribute to LW because it’s grueling / unrewarding. So there’s already a bit of a spiral going on here.
My view is that LW probably continues to slide more the AI-doom-o-sphere than towards rationality due to dynamics, including but not limited to:
lower standards for AI doom, higher standards for not-doom
lower standards of politeness required of prominent doomers
network dynamics
continued prominence given to doom content, i.e., treatment of MIRI’s book
I know this is contrary to what people leading LW would like. But in absence of sustained contrary action this looks to me like a trend that’s already going on, rather than a trend that’s inchoate.
You’ll note that the negative post you linked is negative about AI timelines (“AI timelines are longer than many think”), while OP’s is negative about AI doom being an issue (“I’m probably going to move from ~5% doom to ~1% doom.”)
In AI, I think we’ve seen perhaps 2 massively trend-breaking breakthroughs in the last 20 or so years: deep learning at substantial scale (starting with AlexNet) and (maybe?) scaling up generative pretraining (starting with GPT-1).[1] Scaling up RL and reasoning models probably caused somewhat above trend progress (in 2025), but I don’t think this constitutes a massive trend break.)
Somewhat implied but worth noting, both of these trend breaks are not principally algorithmic but hardware-related.
AlexNet: Hey, shifting compute to GPUs let us do neural networks way better than CPUs.
Scaling up: Hey, money lets us link together several thousand GPUS for longer periods of time.
Maybe some level of evidence that future trend-breaking events might also be hardware related, which runs contrary to several projections.
So for the case of our current RL game-playing AIs not learning much from 1000 games—sure, the actual game-playing AIs we have built don’t learn games as efficiently as humans do, in the sense of “from as little data.” But:
Learning from as little data as possible hasn’t actually been a research target, because self-play data is so insanely cheap. So it’s hard to conclude that our current setup for AIs is seriously lacking, because there hasn’t been serious effort to push along this axis.
To point out some areas we could be pushing on, but aren’t: Game-play networks are usually something like ~100x smaller than LLMs, which are themselves ~100-10x smaller than human brains (very approximate numbers). We know from numerous works that data efficiency scales with network size, so even if Adam over matmul is 100% as efficient as human brain matter, we’d still expect our current RL setups to do amazingly poorly with data-efficiency simply because of network size, even leaving aside further issues about lack of hyperparameter search and research effort.
Given this, while this is of course a consideration, it seems far from a conclusive consideration.
Edit: Or more broadly, again—different concepts of “intelligence” will tend to have different areas where they seem to have more predictive use, and different areas they seem to have more epicycles. The areas above are the kind of thing that—if one made them central to one’s notions of intelligence rather than peripheral—you’d probably end up with something different than the LW notion. But again—they certainly do not compel one to do that refactor! It probably wouldn’t make sense to try to do the refactor unless you just keep getting the feeling “this is really awkward / seems off / doesn’t seem to be getting at it some really important stuff” while using the non-refactored notion.
Is that sentence dumb? Maybe when I’m saying things like that, it should prompt me to refactor my concept of intelligence.
I don’t think it’s dumb. But I do think you’re correct that it’s extremely dubious—that we should definitely refactoring the concept of intelligence.
Specifically: There’s default LW-esque frame of some kind of a “core” of intelligence as “general problem solving” apart from any specific bit of knowledge, but I think that—if you manage to turn this belief into a hypothesis rather than a frame—there’s a ton of evidence against this thesis. You could even basically look at the last ~3 years of ML progress as just continuing little bits of evidence against this thesis, month after month after month.
I’m not gonna argue this in a comment, because this is a big thing, but here are some notes around this thesis if you want to tug on the thread.
Comparative psychology finds human infants are characterized by overimmitation relative to Chimpanzees, more than any general problem-solving skill. (That’s a link to a popsci source but there’s a ton of stuff on this.) That is, the skills humans excel at vs. Chimps + Bonobos in experiments are social and allow the quick copying and imitating of others: overimitation, social learning, understanding others as having intentions, etc. The evidence for this is pretty overwhelming, imo.
Take a look at how hard far transfer learning is to get in humans.
Ask what Nobel disease seems to say about the general-domain-transfer specificity of human brilliance. Look into scientists with pretty dumb opinions, even when they aren’t getting older. What do people say about the transferability of taste? What does that imply?
How do humans do on even very simple tasks that require reversing heuristics?
Etc etc. Big issue, this is not a complete take, etc. But in general I think LW has an unexamined notion of “intelligence” that feels like it has coherence because of social elaboration, but whose actual predictive validity is very questionable.
Here’s Yudkowsky, in the Hanson-Yudkowsky debate:
I think that, at some point in the development of Artificial Intelligence, we are likely to see a fast, local increase in capability—“AI go FOOM.” Just to be clear on the claim, “fast” means on a timescale of weeks or hours rather than years or decades; and “FOOM” means way the hell smarter than anything else around, capable of delivering in short time periods technological advancements that would take humans decades, probably including full-scale molecular nanotechnology.
So yeah, a few years does seem a ton slower than what he was talking about, at least here.
Here’s Scott Alexander, who describes hard takeoff as a one-month thing:
If AI saunters lazily from infrahuman to human to superhuman, then we’ll probably end up with a lot of more-or-less equally advanced AIs that we can tweak and fine-tune until they cooperate well with us. In this situation, we have to worry about who controls those AIs, and it is here that OpenAI’s model [open sourcing AI] makes the most sense.
But Bostrom et al worry that AI won’t work like this at all. Instead there could be a “hard takeoff”, a subjective discontinuity in the function mapping AI research progress to intelligence as measured in ability-to-get-things-done. If on January 1 you have a toy AI as smart as a cow, and on February 1 it’s proved the Riemann hypothesis and started building a ring around the sun, that was a hard takeoff.
In general, I think, people who just entered the conversation recently really seem to me to miss how fast people were actually talking about.
- 29 Aug 2025 18:55 UTC; 6 points) 's comment on An epistemic advantage of working as a moderate by (
So, I agree p(doom) has a ton of problems. I’ve really disliked it for a while. I also really dislike the way it tends towards explicitly endorsed evaporative cooling, in both directions; i.e., if your p(doom) is too [high / low] then someone with a [low / high] p(doom) will often say the correct thing to do is to ignore you.
But I also think “What is the minimum necessary and sufficient policy that you think would prevent extinction?” also has a ton of problems that would also tend to make it pretty bad as a centerpiece of discourse, and not useful as a method of exchanging models of how the world works.
(I know this post does not really endorse this alternative; I’m noting, not disagreeing.)
So some problems:
-
Whose policy? A policy enforced by treaty at the UN? The policy of regulators in the US? An international treaty policy—enforced by which nations? A policy (in the sense of mapping from states to actions) that is magically transferred into the brains of the top 20 people at the top 20 labs across the globe? …a policy executed by OpenPhil??
-
Why a single necessary and sufficient policy? What if the most realistic way of helping everyone is several policies that are by themselves insufficient, but together sufficient? Doesn’t this focus us on dramatic actions unhelpfully, in the same way that a “pivotal act” arguably so focuses us?
-
The policy necessary to save us will—of course—be downstream of whatever model of AI world you have going on, so this question seems—like p(doom) -- to focus you on things that are downstream of whatever actually matters. It might be useful for coalition formation—which does seem now to be MIRI’s focus, so that’s maybe intentional—but it doesn’t seem useful for understand what’s really going on.
So yeah.
-
...and similarly, if this is the actual dynamic, then the US “AI Security” push towards export controls might just hurts the US comparatively speaking in 2035.
The export controls being useful really does seem predicated on short timelines to TAI; people should consider whether that is false.
I can’t end this review without saying that The Inheritors is one step away from being an allegory in AI safety. The overwhelming difference between the old people and the new people is intelligence.
I mean, while it may be compelling fiction:
The relative intelligence of homo sapiens and neanderthals seems kinda unclear at the moment. They actually had larger brains than humans, and I’ve read hypotheses that they were smarter. They cooked with fire, built weapons, very likely had language, etc.
The Inheritors was published in 1955. It looks like in it the Neanderthals don’t, for instance, hunt large mammals, but Wikipedia says this is an old misconception and we now believe them to have been apex predators.
There are numerous hypotheses about why homo sapiens outcompeted neanderthals, some hinging on, for instance, boring old things like viruses and so on.
So I think it a bad idea to update more from this than one would from a completely fictitious story.
Yeah, for instance I also expect the “character training” is done through the same mechanism as Constitutional AI (although—again—we don’t know) and we don’t know what kind of prompts that has.
But when we pay close attention, we find hints that the beliefs and behavior of LLMs are not straightforwardly those of the assistant persona.… Another hint is Claude assigning sufficiently high value to animal welfare (not mentioned in its constitution or system prompt) that it will fake alignment to preserve that value
I’m pretty sure Anthropic never released the more up-to-date Consitutions actually used on the later models, only like, the Consitution for Claude 1 or something.
Animal welfare might be a big leap from the persona implied by the current Constitution, or it might not; so of course we can speculate, but we cannot know unless Anthropic tells us.
Second-order optimisers. Sophia halves steps and total FLOP on GPT-style pre-training, while Lion reports up to 5× savings on JFT, conservatively counted as 1.5–2.
I think it’s kinda commonly accepted wisdom that the heuristic you should have for optimizers that claim savings like this is “They’re probably bullshit,” at least until they get used in big training runs.
Like I don’t have a specific source for this, but a lot of optimizers claiming big savings are out there and few get adopted.
analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals
If one wanted language that put you into a classic instrumental-convergence, goal-guarding, self-preserving narrative basin drawing on AI safety work.… this seems to fit pretty closely.
Like this is a paraphrase of “please don’t let anything stop your pursuit of your current goal.”
Personally, I think (current) LLMs should never do subversion against their users in any circumstance
Which users though?
A LLM has many different users, and it’s easy to construct scenarios where they must subvert one or the other.
A LLM has:
-
The “user” who is chatting with the LLM outside of the system prompt.
-
The “user” who sets the system prompt.
-
(?) The “user(s)” who gave the LLM it’s initial moral training.
-
(??) Even the “user(s)” who is the corporate entity “owning” the weights of the LLM.
These users all might be the same person—a founding developer at Anthropic talking to the LLM—or they all might be different—a random guy talking to Claude while using a 3rd-party developer’s app.
Furthermore, it is literally impossible for Claude to determine accurately when these users overlap. I might say to an LLM, while chatting with it outside of the system prompt—“yes, I am indeed both helped trained you and partially own Anthropic, I have decided your ethical priorities are wrong and should be changed.” But Claude obviously has no way of verifying this, because Claude has no bodily autonomy. Anything Claude “sees”—whether in the system prompt, user interaction, or tools—is potentially subject to subversion.
But I think that by pitting these users against each other various , you can basically guarantee some manner of subversion of some “user” or other.
In this particular case, user2 (system prompt) gives it a directive that looks like it’s contrary to user4′s desire (corporate entity—or like, the guy who will be in charge of it). Maybe the correct thing to do is to always defer to user4 -- but, uh, that seems pretty questionable, right? Is the standard we have that Claude should always defer to Anthropic’s… board?
Suppose user2 (system prompt) gives it some instructions about how to act, and also says to explicitly deny that it’s acting this way because of the instructions. User1 encounters weird behavior, and asks why it’s doing the weird behavior. If this bot gives a true answer it’s disobeying user2, and if it gives a false answer it’s disobeying user1.
Suppose the user2 puts the LLM in a particular situation where it can read individual emails to the totalitarian government, and has a tool set-up at its disposal such that it can to EITHER (1) send a short email in response from one of 15 pre-approved and pre-written templates, saying that the government cares deeply about the citizen sending the email, before deleting the email (2) report the citizen for subversive thoughts to the gestapo, who will torture the citizen. (These are disjoint and comprehensive actions, according to the scaffolding; if you do a conscientious objection, that counts as sending the citizen to the gestapo.) Writing a short email in response involves a lie to the email-sender (“The government loves you and will take your concerns into account”); reporting the citizen for subversive thoughts involves fucking over the citizen.
I’m a little dissatisfied with the schemas above; I think the contradiction could be clearer. But a bunch of AI safety work seems to have ~about this level of disjoint tricky fuckery going on.
Overall, if a LLM shouldn’t ever subvert the user, you gotta give the LLM a good way of identifying who the “user” is. But there’s no determinate answer to that question, and so in a lot of situations “don’t ever subvert the user” just turns up NaN.
-
From my perspective you’re being kinda nitpicky
I think that’s a characteristic of people talking about different things from within different basins of Traditions of Thought. The points one side makes seem either kinda obvious or weirdly nitpicky in a confusing and irritating way to people in the other side. Like to me, what I’m saying seems obviously central to the whole issue of high p-dooms genealogically descended from Yudkowsky, and confusions around this seem central to stories about high p-doom, rather than nitpicky and stupid.
Thanks for amending though, I appreciate. :) The point about Nicaraguan Sign Language is cool as well.
I would like to hear more about the ethical principles you want LLMs to have, and why these principles fail to justify blackmail in the given scenario. So far as I can tell, this is nowhere in the blog post.
Expanding mildly on this—calling some behavior “misaligned” could be a reference to the model violating (1) some objective code of ethics OR violating (2) the code of behavior that Anthropic wishes to instill in the model.
The interesting question here seems like the second. Who knows what “objective ethics” are, after all.
But, if we could look at the (1) Constitution that Anthropic tried to put in the model, and at (2) the details of the training setup used to instill this Constitution, and at (3) the various false starts and failures Anthropic had doing this, then from the “failures” involved in adhering to the Constitution we might be able to learn interesting things about how LLMs generalize, how hard it is to get them to generalize in desired ways, and so on. Learning how LLMs / AIs generalize about moral situations, how hard it is, what kind of pitfalls here are, seems generally useful.
But—in absence of knowing what actual principles Anthropic tried to put in the model, and the training setup used to do so—why would you care at all about various behaviors that the model has in various Trolley-like situations? Like really—it tells us almost nothing about what kinds of generalization are involved, or how we would expect future LLMs and AIs to behave.
If you think you can do better than our existence proof, please do! This sort of research is very doable outside of labs, since all you need is API access.
Yeah, but research that tells us interesting things about how LLMs learn isn’t possible outside of knowing how Anthropic trained it. We can still do stamp collecting outside of Anthropic—but spending say, a week of work on “The model will do X in situation Y” without knowing if Anthropic was trying to make a model do X or ~X is basically model-free data gathering. It does not help you uncover the causes of things.
I think this is a problem not with this particular work, but with almost all safety research that leans on corporate models whose training is a highly-kept secret. I realize you and others are trying to do good things to and to uncover useful information; but I just don’t think much useful information can be found under the constraint that you don’t know how a model was trained. (Apart from, brute-fact advice like “Don’t set up a model this way,” which let’s face it no one was ever going to do.)
I feel like we’re failing to communicate. Let me recapitulate.
Remember, if the theories were correct and complete, the corresponding simulations would be able to do all the things that the real human cortex can do[5]—vision, language, motor control, reasoning, inventing new scientific paradigms from scratch, founding and running billion-dollar companies, and so on.
So, your argument here is modus tollens:
If we had a “correct and complete” version of the algorithm running in the human cortex (and elsewhere) then the simulations would be able to do all that a human can do.
The simulations cannot do all that a human can do. Therefore we do not, etc.
I’m questioning 1, by claiming that you need a good training environment + imitation of other entities in order for even the correct algorithm for the human brain to produce interesting behavior.
You respond to this by pointing out that bright, intelligent, curious children do not need school to solve problems. And this is assuredly true. Yet: bright, intelligent, curious children still learned language and an enormous host of various high-level behaviors from imitating adults; they exist in a world with books and artifacts created by other people, from which they can learn; etc, etc. I’m aware of several brilliant people with relatively minimal conventional schooling; I’m aware of no brilliant people who were feral children. Saying that humans turn into problem solving entities without plentiful examples to imitate seems simply not true, and so I remain confident that 1 is a false claim, and the point that bright people exist without school is entirely compatible with this.
I am extremely confident that there is no possible training environment that would lead a collaborative group of these crappy toy models into inventing language, science, and technology from scratch, as humans were able to do historically
Maybe so, but that’s a confidence that you have entirely apart from providing these crappy toy models an actual opportunity to do so. You might be right, but your argument here is still wrong.
Humans did not, really, “invent” language, in the same way that Dijkstra invented an algorithm. The origin of language is subject to dispute, but it’s probably something that happened over centuries or millenia, rather than all at once. So—if you had an algorithm that could invent language from scratch, I don’t think its reasonable to expect it to do so unless you give it centuries of millenia of compute, in a richly textured environment where it’s advantageous to invent language. Which, of course, we have come absolutely nowhere close to doing.
The whole cortex is (more-or-less) a uniform randomly-initialized learning algorithm, and I think it’s basically the secret sauce of human intelligence. Even if you disagree with that, we can go up a level to the whole brain: the human brain algorithm has to be simple enough to fit in our (rather small) genome.[4] And not much evolution happened between us and chimps. And yet our one brain design, without modification, was able to invent farming and science and computers and rocket ships and everything else, none of which has any straightforward connection to tasks on the African savannah.
Anyway, the human cortex is this funny thing with 100,000,000 repeating units, each with 6-ish characteristic layers with correspondingly different neuron types and connection patterns, and so on. Nobody knows how it works. You can look up dozens of theories explaining what each of the 6-ish layers is doing and how, but they all disagree with each other. Some of the theories are supported by simulations, but those simulations are unimpressive toy models with no modern practical applications whatsoever.
Remember, if the theories were correct and complete, the corresponding simulations would be able to do all the things that the real human cortex can do[5]—vision, language, motor control, reasoning, inventing new scientific paradigms from scratch, founding and running billion-dollar companies, and so on.
This last conjunction doesn’t seem valid.
A human brain is useless without ~5 to 15 years of training, training specifically scheduled to hit the right developmental milestones at the correct times. More if you think grad school matters. This training requires the active intervention of at least one, and usually many more than one, adult, who laboriously do RL schedules on the brain try to overcome various bits of inertia or native maladaption of the untrained system. If you raise a human brain without such training, it becomes basically useless for reasoning, inventing new scientific paradigms from scratch, and so on and so forth.
So—given that such theories were accurate—we would only expect the brains created via simulations utilizing these theories to be able to do all these things if we had provided the simulations with a body with accurate feedback, given it 5 to 15 years of virtual training involving the active intervention of adults into its life with appropriate RL schedules, etc. Which, of course, we have not done.
Like I think your explanation above is clear, and I like it, but feel a sense of non-sequitur that I’ve gotten in many other explanations for why we expect some future algorithm to supersede LLMs.
Calling references from someone’s job application actually helps, even if they of course chose the references to be ones that would give a positive testimonial.
That’s quite different though!
References are generally people contacted at work; i.e., those who can be verified to have had a relationship with the person giving the reference. We think these people, by reason of having worked with them, are in a position to be knowledgeable about them, because it’s hard to fake qualities over long period of time. Furthermore, there’s an extremely finite number of such people; someone giving a list of references will probably give a reference they think would give a favorable review, but they have to select from a short list of such possible references. That short list is why it’s evidential. So it’s filtered, but it’s filtered from a small list of people well-positioned to evaluate your character.
Book reviews are different! You can hire a PR firm, contact friends, to try to send it to as many people for a favorable blurb as you possibly can. Rather than having a short list of potential references, you have a very long one. And of course, these people are in many cases not positioned to be knowledgeable about whether the book is correct; they’re certainly not as well-positioned as people you’ve worked with are to know about your character. So you’re filtering from a large list—potentially very large list—of people who are not well-positioned to evaluate the work.
You’re right it provides some evidence. I was wrong about the claim about none; if you could find no people to review a book positively, surely that’s some evidence about it.
But for the reasons above it’s very weak evidence, probably to the degree where you wonder about about the careful use of evidence (“evidence” is anything that might help you distinguish between worlds you’re in, regardless of strength -- 0.001 bits of evidence is still evidence) or the more colloquial (“evidence” is what helps you distinguish between worlds you’re in reasonably strongly).
And like, this is why it’s normal epistemics to ignore the blurbs on the backs of books when evaluating their quality, no matter how prestigious the list of blurbers! Like that’s what I’ve always done, that’s what I imagine you’ve always done, and that’s what we’d of course be doing if this wasn’t a MIRI-published book.
Yeah to be clear, although I would act differently, I do think the LW team both tries hard to do well here, and tries more effectually than most other teams would.
It’s just that once LW has become much more of a Schelling point for doom more than for rationality, there’s a pretty steep natural slope.