PeterMcCluskey

Karma: 4,080

PeterMcCluskey 7 Sep 2025 1:14 UTC
12 points
0
on: The Problem
[Cross-posted from my blog.]

A group of people from MIRI have published a mostly good introduction to the dangers of AI: The Problem. It is a step forward at improving the discussion of catastrophic risks from AI.

I agree with much of what MIRI writes there. I strongly agree with their near-term policy advice of prioritizing the creation of an off switch.

I somewhat disagree with their advice to halt (for a long time) progress toward ASI. We ought to make preparations in case a halt turns out to be important. But most of my hopes route through strategies that don’t need a halt.

A halt is both expensive and risky.

My biggest difference with MIRI is about how hard it is to adequately align an AI. Some related differences involve the idea of a pivotal act, and the expectation of a slippery slope between human-level AI and ASI.

Important Agreement

There isn’t a ceiling at human-level capabilities.

This is an important truth, that many people reject because they want it not to be true.

It would be lethally dangerous to build ASIs that have the wrong goals.

The default outcome if we’re careless about those goals might well be that AIs conquer humans.

If you train a tiger not to eat you, you haven’t made it share your desire to survive and thrive, with a full understanding of what that means to you.

This is a good way to frame a key part of MIRI’s concern. We should be worried that current AI company strategies look somewhat like this. But the way that we train dogs seems like a slightly better analogy for how AI training is likely to work in a few years. That’s not at all sufficient by itself for us to be safe, but it has a much better track record for generalized loyalty than training tigers.

Can We Stop Near Human-Level?

The development of systems with human-level generality is likely to quickly result in artificial superintelligence (ASI)

This seems true for weak meanings of “likely” or “quickly”. That is enough to scare me. But MIRI hints at a near-inevitability (or slippery slope) that I don’t accept.

I predict that it will become easier to halt AI development as AI reaches human levels, and continue to get easier for a bit after that. (But probably not easy enough that we can afford to become complacent.)

Let’s imagine that someone produces an AI that is roughly as intellectually capable as Elon Musk. Is it going to prioritize building a smarter AI? I expect it will be more capable of evaluating the risks than MIRI is today, due in part to it having better evidence than is available today about the goals of that smarter AI. If it agrees with MIRI’s assessment of the risk, wouldn’t it warn (or sabotage?) developers instead? Note that this doesn’t require the Musk-level AI to be aligned with humans—it could be afraid that the smarter AI would be unaligned with the Musk-level AI’s goals.

There are a number of implicit ifs in that paragraph, such as if progress produces a Musk-level AI before producing an ASI. But I don’t think my weak optimism here requires anything far-fetched. Even if the last AI before we reach ASI is less capable than Musk, it will have significant understanding of the risks, and will likely have a good enough track record that developers will listen to its concerns.

[How hard would it be to require AI companies to regularly ask their best AIs how risky it is to build their next AI?]

I suspect that some of the sense of inevitability comes from the expectation that the arguments for a halt are as persuasive now as they will ever be.

On the contrary, I see at least half the difficulty in slowing progress toward ASI is due to the average voter and average politician believing that AI progress is mostly hype. Even superforecasters have tended to dismiss AI progress as hype.

I’m about 85% confident that before we get an AI capable of world conquest, we’ll have an AI that is capable of convincing most voters that AI is powerful enough to be a bigger concern than nuclear weapons.

MIRI is focused here on dispelling the illusion that it will be technologically hard to speed past human intelligence levels. The main point of my line of argument is that we should expect some changes in willingness to accelerate, hopefully influenced by better analyses of the risks.

I’m unsure whether this makes much difference for our strategy. It’s hard enough to halt AI progress that we’re more likely to achieve it just in the nick of time than too early. The main benefit of thinking about doing a halt when AI is slightly better than human is that it opens up better possibilities for enforcing the halt than we’ll envision if we imagine that the only time for a halt is before AI reaches human levels.

I’m reminded of the saying “You can always count on Americans to do the right thing—after they’ve tried everything else.”

Alignment difficulty

MIRI’s advice depends somewhat heavily on the belief that we’re not at all close to solving alignment. Whereas I’m about 70% confident that we already have the basic ideas needed for alignment, and that a large fraction of the remaining difficulty involves distinguishing the good ideas from the bad ones, and assembling as many of the good ideas as we can afford into an organized strategy. (I don’t think this is out of line with expert opinion on the subject. However, the large range of expert opinions on this subject worries me a good deal.)

[The Problem delegates most discussion of alignment difficulty to the AGI Ruin page, which is a slightly improved version of Eliezer’s AGI Ruin: A List of Lethalities. This section of my post is mostly a reply to that. ]

Corrigibility is anti-natural to consequentialist reasoning

No! It only looks that way because you’ve tried to combine corrigibility with a conflicting utility function.

The second course is to build corrigible AGI which doesn’t want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.

You’re trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555.

That describes some attempts at corrigibility, in particular those which give the AI additional goals that are not sub-goals of corrigibility. Max Harms’ CAST avoids this mistake.

Corrigibility creates a basin of attraction that increases the likelihood of getting a good enough result on the first try, and mitigates MIRI’s concerns about generalizing out of distribution.

There are still plenty of thorny implementation details, and concerns about who should be allowed to influence a corrigible AGI. But it’s hard to see how a decade of further research would produce new insights that can’t be found sooner.

Another way that we might be close to understanding how to create a safe ASI is Drexler’s CAIS. Which roughly means keeping AI goals very short-term and tool-like.

I’m guessing that MIRI’s most plausible objection is that AIs created this way wouldn’t be powerful enough to defend us against more agentic AIs that are likely to be created. MIRI is probably wrong about that defense, due to some false assumptions about some of the relevant coordination problems.

MIRI often talks about pivotal acts such as melting all GPUs. I expect defense against bad AIs to come from pivotal processes that focus on persuasion and negotiation, and to require weaker capabilities than what’s needed for melting GPUs. Such pivotal processes should be feasible earlier than I’d expect an AI to be able to melt GPUs.

How does defending against bad AI with the aid of human-level CAIS compare to MIRI’s plan to defend by halting AI progress earlier? Either way, I expect the solution to involve active enforcement by leading governments.

The closer the world gets to ASI, the better surveillance is needed to detect and respond to dangers. And maybe more regulatory power is needed. But I expect AI to increasingly help with those problems, such that pivotal processes which focus on global agreements to halt certain research become easier. I don’t see a clear dividing line between proposals for a halt now, and the pivotal processes that would defend us at a later stage.

I’ll guess that MIRI disagrees, likely due to assigning a much higher probability than I do to a large leap in AI capabilities, producing a world conquering agent before human-level CAIS has enough time to implement defenses.

The CAIS strategy is still rather tricky to implement. CAIS development won’t automatically outpace the development of agentic AI. So we’ll need either some regulation, or a further fire alarm that causes AI companies to become much more cautious.

It is tricky to enforce a rule that prohibits work on more agentic AIs, but I expect that CAIS systems of 2027 will be wise enough to do much of the needed evaluations of whether particular work violates such a rule.

Corrigibility and CAIS are the two clearest reasons why I’m cautiously optimistic that non-catastrophic ASI is no harder than the Manhattan and Apollo projects. Those two reasons make up maybe half of my reasoning here. I’ve focused on them because the other reasons involve a much wider range of weaker arguments that are harder to articulate.

Alas, there’s a large gap between someone knowing the correct pieces of a safe approach to AI, and AI companies implementing them. Little in current AI company practices inspires confidence in their ability to make the right choices.

Conclusion

Parts of The Problem are unrealistically pessimistic. Yet the valid parts of their argument are robust enough to justify being half as concerned as they are. My policy advice overlaps a fair amount with MIRI’s advice:

Creating an off switch should be the most urgent policy task.

Secondly, require AI companies to regularly ask their best AIs how risky it is to create their next AI. Even if it only helps a little, the cost / benefit ratio ought to be great.

Policy experts ought to be preparing for ways to significantly slow or halt a subset of AI development for several years. Ideally this should focus on restricting agentic AI, while exempting CAIS. The political climate has a decent chance of becoming ripe for this before the end of the decade. The timing is likely to depend heavily on what accidents AIs cause.

The details of such a halt should depend somewhat on advice given by AIs near the time of the halt.

None of these options are as safe as I would like.

A halt carries its own serious risks: black market development without safety constraints, and the possibility that when development resumes it will be faster and less careful than continued cautious progress would have been. [These concerns deserve their own post, but briefly: halts are unstable equilibria that may make eventual development more dangerous rather than less.]

When I started to write this post, I planned to conclude that I mostly agreed with MIRI’s policy advice. But now I’ve decided that the structural similarities are masking a dramatic difference in expected cost. I anticipate that the tech industry will fight MIRI’s version much more strongly than they will resist mine. That leaves me with conflicting feelings about whether to treat MIRI’s position as allied versus opposed to mine.

I expect that as we get more experience with advanced AIs, we will get more information that is relevant to deciding whether a halt is desirable. Let’s not commit ourselves so strongly on any particular policy that we can’t change our minds in response to new evidence.

P.S. I asked Gemini2.5pro to guess how Eliezer would react to Max Harms’ CAST. It was sufficiently confused about CAST that I gave up - it imagined that the key advantage was that the AI had a narrow goal. Claude Opus 4.1 did better—I needed to correct one misunderstanding of CAST, then it gave some non-embarrassing guesses.

PeterMcCluskey 24 Aug 2025 1:50 UTC
2 points
0
in reply to: Jiro’s comment on: When Money Becomes Power
It is too decentralized to qualify as the kind of centralized power that WalterL was talking about, and probably too decentralized to fit the concerns that Gabriel expressed.

PeterMcCluskey 23 Aug 2025 1:28 UTC
4 points
−6
in reply to: WalterL’s comment on: When Money Becomes Power

So a greater power is necessary to prevent bad actors from concentrating it?

No. Amish society is pretty successful at stopping concentrations of power, mostly via peer pressure.

PeterMcCluskey 22 Aug 2025 2:47 UTC
4 points
0
on: Being honest with AIs

If we’re being honest, the compensation would probably have to be capped at some maximum amount. If the AIs gave up an 80% chance at world takeover for our benefit, it would probably not be within an AI company’s power to give away 80% of all future resources in compensation (or anything close to that).

It seems pretty hard to predict whether an AI company would have such power in conditions which are that unusual. After all, it would have a pretty powerful AI trying to enforce the agreement.

I don’t see the benefit to setting a cap. Let’s just inform the AI as best we can about the uncertainties involved, and promise to do the best we can to uphold agreements.

AI-Oriented Investments

PeterMcCluskey20 Jul 2025 21:31 UTC

28 points

0 comments1 min readLW link

(bayesianinvestor.com)

PeterMcCluskey 20 Jul 2025 18:55 UTC
4 points
0
in reply to: Mass_Driver’s comment on: Mainstream Grantmaking Expertise (Post 7 of 7 on AI Governance)
As a donor, I’m nervous about charities that pay fully competitive wages, although it only gets about 2% weighting in my decisions. If someone can clearly make more money somewhere else, then that significantly reduces my concern that they’ll mislead me about the value of their charity.

PeterMcCluskey 10 Jul 2025 2:50 UTC
2 points
0
on: Are Intelligent Agents More Ethical?
I’ve found more detailed comments from Sumner on this topic, and replied to them here.

PeterMcCluskey 10 Jul 2025 2:18 UTC
5 points
3
on: Foom & Doom 1: “Brain in a box in a basement”

Remember, if the theories were correct and complete, then they could be turned into simulations able to do all the things that the real human cortex can do[5]—vision, language, motor control, reasoning, inventing new scientific paradigms from scratch, founding and running billion-dollar companies, and so on.

So here is a very different kind of learning algorithm waiting to be discovered

There may be important differences in the details, but I’ve been surprised by how similar the behavior is between LLMs and humans. That surprise is in spite of me having suspected for decades that artificial neural nets would play an important role in AI.

It seems far-fetched that a new paradigm is needed. Saying that current LLMs can’t build billion-dollar companies seems a lot like saying that 5-year-old Elon Musk couldn’t build a billion-dollar company. Musk didn’t seem to need a paradigm shift to get from the abilities of a 5-year-old to those of a CEO. Accumulation of knowledge seems like the key factor.

But thanks for providing an argument for foom that is clear enough that I can be pretty sure why I disagree.

Are Intelligent Agents More Ethical?

PeterMcCluskey20 Jun 2025 21:26 UTC

13 points

7 comments2 min readLW link

PeterMcCluskey 29 May 2025 19:55 UTC
2 points
0
in reply to: PeterMcCluskey’s comment on: AI #116: If Anyone Builds It, Everyone Dies
They’ve done even better over the past week. I’ve written more on my blog.

PeterMcCluskey 23 May 2025 16:58 UTC
6 points
0
on: Please Donate to CAIP (Post 1 of 3 on AI Governance)
I’ve donated $30,000.

PeterMcCluskey 19 May 2025 2:32 UTC
2 points
0
on: AI #116: If Anyone Builds It, Everyone Dies

The budget is attempting to gut nuclear

Yet the stock prices of nuclear-related companies that I’m following have done quite well this month (e.g. SMR). There doesn’t seem to be a major threat to nuclear power.

PeterMcCluskey 27 Apr 2025 17:40 UTC
2 points
0
in reply to: Mitchell_Porter’s comment on: AI 2027 Thoughts
I expect deals between AIs to make sense at the stage that AI 2027 describes because the AIs will be uncertain what will happen if they fight.

If AI developers expected winner-take-all results, I’d expect them to be publishing less about their newest techniques, and complaining more about their competitors’ inadequate safety practices.

Beyond that, I get a fairly clear vibe that’s closer to “this is a fascinating engineering challenge” than to “this is a military conflict”.

AI 2027 Thoughts

PeterMcCluskey26 Apr 2025 0:00 UTC

29 points

2 comments6 min readLW link

(bayesianinvestor.com)

Should AIs be Encouraged to Cooperate?

PeterMcCluskey15 Apr 2025 21:57 UTC

13 points

2 comments5 min readLW link

(bayesianinvestor.com)

PeterMcCluskey 31 Mar 2025 16:04 UTC
5 points
1
on: OpenAI lost $5 billion in 2024 (and its losses are increasing)
This reminds me a lot about what people said about Amazon near the peak of the dot-com bubble (and also about what people also said at the time of internet startups that actually failed).

PeterMcCluskey 23 Mar 2025 16:26 UTC
4 points
0
in reply to: Noosphere89’s comment on: Three Types of Intelligence Explosion
The first year or two of human learning seem optimized enough that they’re mostly in evolutionary equilibrium—see Henrich’s discussion of the similarities to chimpanzees in The Secret of Our Success.

Human learning around age 10 is presumably far from equilibrium.

I’ll guess that I see more of the valuable learning taking place in the first 2 years or so than do other people here.

PeterMcCluskey 17 Mar 2025 16:42 UTC
6 points
0
on: Three Types of Intelligence Explosion
I agree with most of this, but the 13 OOMs from the the software feedback loop sounds implausible.

From How Far Can AI Progress Before Hitting Effective Physical Limits?:

the brain is severely undertrained, humans spend only a small fraction of their time on focussed academic learning

I expect that humans spend at least 10% of their first decade building a world model, and that evolution has heavily optimized at least the first couple of years of that. A large improvement in school-based learning wouldn’t have much effect on my estimate of the total learning needed.

PeterMcCluskey 16 Mar 2025 1:33 UTC
3 points
0
on: Can time preferences make AI safe?
This general idea has been discussed under the term myopia.

PeterMcCluskey 15 Mar 2025 4:13 UTC
2 points
0
in reply to: Raemon’s comment on: Anthropic, and taking “technical philosophy” more seriously
I’m assuming that the AI can accomplish its goal by honestly informing governments. Possibly that would include some sort of demonstration that the of the AI’s power that would provide compelling evidence that the AI would be dangerous if it wasn’t obedient.

I’m not encouraging you to be comfortable. I’m encouraging you to mix a bit more hope in with your concerns.

PeterMcCluskey

Important Agreement

Can We Stop Near Human-Level?

Alignment difficulty

Conclusion

AI-Ori­ented Investments

Are In­tel­li­gent Agents More Eth­i­cal?

AI 2027 Thoughts

Should AIs be En­couraged to Co­op­er­ate?

AI-Oriented Investments

Are Intelligent Agents More Ethical?

Should AIs be Encouraged to Cooperate?