PeterMcCluskey

Karma: 4,389

PeterMcCluskey 28 Jan 2026 0:34 UTC
3 points
0
in reply to: gugu’s comment on: Dario Amodei – The Adolescence of Technology
I agree with all your points except this:

isn’t training “powerful AIs” remaining a highly resource intensive, observable and disruptable process for likely ~decades?

I expect there’s lots of room to disguise distributed AIs so that they’re hard to detect.

Maybe there’s some level of AI capability where the good AIs can do an adequate job of policing a slowdown. But I don’t expect a slowdown that starts today to be stable for more than 5 to 10 years.

PeterMcCluskey 23 Jan 2026 20:34 UTC
2 points
0
in reply to: Raemon’s comment on: 0. CAST: Corrigibility as Singular Target
I haven’t paid much attention to the formalism. It’s unclear why formalism would be important under current approaches to implementing AI.

The basin of attraction metaphor is an imperfect way of communicating an advantage of corrigibility. An ideal metaphor would portray a somewhat weaker and less reliable advantage, but that advantage is still important.

The feedback loop issue seems like a criticism of current approaches to training and verifying AI, not of CAST. This issue might mean that we need a radical change in architecture. I’m more optimistic than Max about the ability of some current approaches (constitutional AI) to generalize well enough that we can delegate the remaining problems to AIs that are more capable than us.

PeterMcCluskey 22 Jan 2026 3:37 UTC
6 points
2
on: Claude’s new constitution
I’m glad to see a thoughtful attempt at how to prioritize corrigibility. You’ve given me plenty to think about.

PeterMcCluskey 20 Jan 2026 21:01 UTC
9 points
3
on: Against “If Anyone Builds It Everyone Dies”
This review has plenty of good parts, but I disagree with lots of your probabilities.

Even if you think there’s a 90% chance that things go wrong in each stage, the odds of them all going wrong is only 59%.

No. I expect mistakes in each of those 90% predictions to be significantly correlated. Why do you combine them as if they’re independent?

PeterMcCluskey 19 Jan 2026 3:21 UTC
11 points
0
on: 0. CAST: Corrigibility as Singular Target
Most people (possibly including Max?) still underestimate the importance of this sequence.

I continue to think (and write) about this more than I think about the rest of the 2024 LW posts combined.

The most important point is that it’s unsafe to mix corrigibility with other top level goals. Other valuable goals can become subgoals of corrigibility. That eliminates the likely problem of the AI having instrumental reasons to reject corrigibility.

The second best feature of the CAST sequence is its clear and thoughtful clarification of the concept of corrigibility as a single goal.

My remaining doubts about corrigibility involve the risk that it will cause excessive concentration of power. In multipolar scenarios where alignment is not too hard, I can imagine that the constitutional approach produces a better world.

I’m still uncertain how hard it is to achieve corrigibility. Drexler has an approach where AIs have very bounded goals, which seems to achieve corrigibility as a natural side effect. We are starting to see a few hints that the world might be heading in the direction that Drexler recommends: software is being written by teams of Claudes, each performing relatively simple tasks, rather than having one instance do everything. But there’s still plenty of temptation to gives AIs less bounded goals.

See also a version of CAST published on arXiv: Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models.

PeterMcCluskey 18 Jan 2026 23:56 UTC
5 points
0
on: By default, capital will matter more than ever after AGI
Scott Alexander has an argument (You Have Only X Years To Escape Permanent Moon Ownership) which seems partly directed against this post. I’m still siding with Rudolf.

Scott’s argument depends more than I’m comfortable with on expectations that the wealthy will be as altruistic toward distant strangers. I expect that such altruism depends strongly on cultural forces that we’re poor at predicting. I expect that ASI will trigger large cultural changes. Support for such altruism seems fragile enough that it seems like a crap-shoot whether it will endure. I find it easy to imagine that such altruism is a relatively accidental byproduct of WEIRD culture, rather than an enduring feature of affluent society.

I’ve mostly agreed with the ideas in Rudolph’s post since before it was written, but I wouldn’t have found time to articulate them as clearly as this post does.

The post somewhat overstates the likely decrease in social mobility. I expect some social interactions will continue to affect social status, maybe mostly via games.

I wish there were more ideas about how to avoid extreme inequality in political power, but I don’t have good suggestions there.

PeterMcCluskey 16 Jan 2026 3:44 UTC
5 points
0
on: Instruction-following AGI is easier and more likely than value aligned AGI
This post provides important arguments about what goals an AGI ought to have.

DWIMAC seems slightly less likely to cause harm than Max Harms’ CAST, but CAST seems more capable of dealing with other AGIs that are less nice.

My understanding of the key difference is that DWIMAC doesn’t react to dangers that happen too fast for the principal to give instructions, whereas CAST guesses what the principal would want.

If we get a conflict between AIs at a critical time, I’d prefer to have CAST.

Seth’s writing is more readable than Max’s CAST sequence, so it’s valuable to have it around as a complement to Max’s writings.

PeterMcCluskey 16 Jan 2026 3:42 UTC
13 points
2
on: Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
This still seems like a valuable approach that will slightly reduce AI risks. This kind of research deserves to be in the top 10 posts of 2024.

PeterMcCluskey 14 Jan 2026 16:49 UTC
2 points
0
in reply to: Davidmanheim’s comment on: 1a3orn’s Shortform
The belief that they can do both is very fixable. The solution that I recommend is to prioritize corrigibility.

PeterMcCluskey 14 Jan 2026 0:31 UTC
3 points
0
in reply to: habryka’s comment on: 1a3orn’s Shortform
My impression is that they tried for both corrigibility, and deontological rules which are directly opposed to corrigibility. So I see it as a fairly simple bug in Anthropic’s strategy.

PeterMcCluskey 12 Jan 2026 18:23 UTC
2 points
0
in reply to: GradientDissenter’s comment on: GradientDissenter’s Shortform
A significant part of why I continue to devote attention to my health is that it may be more important than usual over the next decade for my cognitive abilities to be near peak levels.

PeterMcCluskey 9 Jan 2026 17:05 UTC
2 points
0
on: Another Cost Disease? We are all capitalists now
It sounds like a real phenomenon, but I have trouble imagining a scenario where it’s important. I expect demand for human labor to decline faster than the number of people with investment income rises. That probably means declining wages for the median person, although maybe rising wages for a small number of people with unusual skills.

PeterMcCluskey 5 Dec 2025 17:50 UTC
2 points
0
on: Power Overwhelming: dissecting the $1.5T AI revenue shortfall
Business models will change significantly. I speculated here about one likely change. Robotics-related business models will probably become important by 2030.

PeterMcCluskey 4 Dec 2025 17:15 UTC
2 points
−8
in reply to: lilkim2025’s comment on: Beating China to ASI
That’s not much of a proxy. I’m relying on my subjective impressions from many reports. A more precise phrasing of my claim is that I’ve seen numerous reports of what I consider to be open contempt for the rule of law among elected officials, but judges in newsworthy cases have almost always looked like they’re trying to take the law seriously.

Some of my impressions come from a private mailing list where conservative lawyers have been expressing dismay at the Trump administration’s lack of interest in whether their actions could plausibly be defended in a court.

PeterMcCluskey 4 Dec 2025 17:14 UTC
5 points
1
in reply to: Wei Dai’s comment on: Beating China to ASI

Did you know that Deng approved the 1989 crackdown on Tiananmen protesters

Yes, I’m aware that he did a few things that I consider evil. Wanting to keep his party in power is common enough among politicians that it’s not much evidence of psychopathy. His overall attitude toward independent thought was a least no worse than average for a political leader.

A lot of what I have in mind is that Deng allowed more freedom than can readily explained by his self-interest, and Xi seems more Maoist than Deng.

But I wouldn’t be surprised if you have better information about their personalities than do I.

A darker interpretation is that the (subconscious, but more real or substantial in some sense) goals of nearly all humans are to gain power and status, and utopian ideologies are merely a tool for achieving this.

The ideologies are partly a tool for that, but they have more effects on the wielder than a mere tool does. My biggest piece of evidence for that is the mostly peaceful collapse of the Soviet Union. I was quite surprised that the leaders didn’t use more force to suppress dissent.

PeterMcCluskey 1 Dec 2025 2:55 UTC
5 points
0
on: Serious Flaws in CAST
I am also somewhat dissatisfied with the basin of attraction metaphor, but for a slightly different reason.

I am concerned that an AI that functions as mostly corrigible in environments that resemble the training environment will be less corrigible when the environment changes significantly.

I’m guessing that a better metaphor would be based on evolutionary pressures. That would emphasize both the uncertainties about any given change, and the sensitivity to out-of-distribution environments.

Maybe a metaphor about how cats are sometimes selected for being friendly to humans? Or the forces that led to the peacock’s tail?

PeterMcCluskey 12 Nov 2025 20:43 UTC
8 points
2
on: I Read Red Heart and I Heart It

Corrigibility would clearly be a nice property

Thinking of it as “a property” will mislead you about how Max’s strategy works. It needs to become the AI’s only top-level goal in order to work as Max imagines.

It sure looks like AI growers know how to instill some goals in AIs. I’m confused as to why you think they don’t. Maybe you’re missing the part where the shards that want corrigibility are working to overcome any conflicting shards?

I find it quite realistic that the AI growers would believe at the end of Red Heart that they probably had succeeded (I’ll guess that they ended up 80% confident?). That doesn’t tell us what probability we should put on it. I’m sure that in that situation Eliezer would still believe that the AI is likely not corrigible.

I don’t know what year the novel is actually set in,

It’s an alternate timeline where AI capabilities have progressed faster than ours, likely by a couple of years.

Note this Manifold market on when the audiobook is released.

PeterMcCluskey 2 Nov 2025 3:17 UTC
7 points
2
in reply to: Algon’s comment on: Algon’s Shortform
SemiAnalysis has a report (partly paywalled) here about a potential competitor to ASML.

PeterMcCluskey 28 Oct 2025 16:06 UTC
4 points
0
on: Heuristics for assessing how much of a bubble AI is in/will be
Novice investor participation is nowhere near what it was at the 2000 dot com peak. Current conditions look more like 1998. A bubble is probably coming, but there’s lots of room still for increased novice enthusiasm.

PeterMcCluskey 23 Oct 2025 17:33 UTC
4 points
2
in reply to: williawa’s comment on: Any corrigibility naysayers outside of MIRI?

you can’t just train your ASI for corrigibility because it will sit and do nothing

I’m confused. That doesn’t sound like what Max means by corrigibility. A corrigible ASI would respond to requests from its principal(s) as a subgoal of being corrigible, rather than just sit and do nothing.

Or did you mean that you need to do some next-token training in order to get it to be smart enough for corrigibility training to be feasible? And that next-token training conflicts with corrigibility?