So8res

Karma: 19,423

Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Raemon, Eliezer Yudkowsky and So8res

30 Sep 2025 0:12 UTC

87 points

54 comments17 min readLW link

So8res 23 Sep 2025 13:10 UTC
9 points
0
in reply to: Neel Nanda’s comment on: Safety researchers should take a public stance
I agree that large companies are likely incoherent in this way; that’s what I was addressing in my follow-on comment :-). (Short version: I think getting a warning and then pressing the issue is a great way to press the company for consistency on this (important!) issue, and I think that it matters whether the company coheres around “oh yeah, you’re right, that is okay” vs whether it coheres around “nope, we do alignmentwashing here”.)

With regards to whether senior figures are paying attention: my guess is that if a good chunk of alignment researchers (including high-profile ones such as yourself) are legitimately worried about alignmentwashing and legitimately considering doing your work elsewhere (and insofar as you prefer telling the media if that happens—not as a threat but because informing the public is the right thing to do) -- then, if it comes to that extremity, I think companies are pretty likely to get the senior figures involved. And I think that if you act in a reasonable, sensible, high-integrity way throughout the process, that you’re pretty likely to have pretty good effects on the internal culture (either by leaving or by causing the internal policy to change in a visible way that makes it much easier for researchers to speak about this stuff).

So8res 22 Sep 2025 20:08 UTC
12 points
5
in reply to: So8res’s comment on: Safety researchers should take a public stance
For what it’s worth, I think that it’s pretty likely that the bureaucratic processes at (e.g.) Google haven’t noticed that acknowledging that the race to superintelligence is insane has a different nature than (e.g.) talking about the climate impacts of datacenters, and I wouldn’t be surprised if (e.g.) Google issued one of their researchers a warning the first time they mentioned things, not out of deliberate sketchiness but just out of bureaucratic habit. My guess is that that’d be a great opportunity to push back, spell out the reason why the cases are different, and see whether the company stands up to its alleged principles or codifies its alignmentwashing practices. If you have the opportunity to spur that conversation, I think that’d be real cool of you—I think there’s a decent chance it would spark a bunch of good internal cultural change, and also a decent chance that it would make the issues with staying at the lab much clearer (both internally, and to the public if a news story came of it).

So8res 22 Sep 2025 19:58 UTC
43 points
36
in reply to: Neel Nanda’s comment on: Safety researchers should take a public stance
Thanks for the clarification. Yeah, from my perspective, if casually mentioning that you agree with the top scientists & lab heads & many many researchers that this whole situation is crazy causes your host company to revoke your permission to talk about your research publicly (maybe after a warning), then my take is that that’s really sketchy and that contributing to a lab like that is probably substantially worse than your next best opportunity (e.g. b/c it sounds like you’re engaging in alignmentwashing and b/c your next best opportunity seems like it can’t be much worse in terms of direct research).

(I acknowledge that there’s room to disagree about whether the second-order effect of safetywashing is outweighed by the second-order effect of having people who care about certain issues existing at the company at all. A very quick gloss of my take there: I think that if the company is preventing you from publicly acknowledging commonly-understood-among-experts key features of the situation, in a scenario where the world is desperately hurting for policymakers and lay people to understand those key features, I’m extra skeptical that you’ll be able to reap the imagined benefits of being a “person on the inside”.)

I acknowledge that there are analogous situations where a company would feel right to be annoyed, e.g. if someone were casually bringing up their distantly-related political stances in every podcast. I think that this situation is importantly disanalogous, because (a) many of the most eminent figures in the field are talking about the danger here; and (b) alignment research is used as a primary motivating excuse for why the incredibly risky work should be allowed to continue. There’s a sense in which the complicity of alignment researchers is a key enabling factor for the race; if all alignment researchers resigned en masse citing the ridiculousness of the insanity of the race then policymakers would be much more likely to go “wait, what the heck?” In a situation like that, I think the implicit approval of alignment researchers is not something to be traded away lightly.

So8res 22 Sep 2025 18:55 UTC
31 points
7
in reply to: Rohin Shah’s comment on: This is a review of the reviews
(Fwiw, I personally disclaim any social pressure that people should avoid mentioning or discussing their disagreements; that’d be silly. I am in favor of building upon areas of agreement, and I am in favor of being careful to avoid misleading the public, and I am in favor of people who disagree managing to build coalitions, but I’m not in favor of people feeling like it’s time to stfu. I think the “misleading the public” thing is a little delicate, because I think it’s easy for onlookers to think experts are saying “i disagree [that the current situation is reckless and crazy and a sane world would put a stop to it]” when in fact experts are trying to say “i disagree [about whether certain technical plans have a middling probability of success, though of course i agree that the current situation is reckless and crazy]”, and it can be a bit tricky to grumble about this effect in a fashion that doesn’t come across as telling people to stfu about their disagreements. My attempt to thread that needle is to remind people that this misunderstanding is common and important, and thus to suggest that when people have a broad audience, they work to combat this misread :-))

So8res 22 Sep 2025 17:33 UTC
45 points
14
in reply to: Rohin Shah’s comment on: This is a review of the reviews
My impression of the lesson from the Shanghai Communique is not “parties should only ever say things everyone else will agree with them on” but rather “when talking to broad audiences, say what you believe; when attempting to collaborate with potential partners, build as much collaboration as you can on areas of agreement.”

I don’t have much interest in trying to speak for everyone, as opposed to just for myself. Weakening the title seems to me like it only makes sense in a world where I’m trying to represent some sort of intersectional view that most everyone agrees upon, instead of just calling it like I see it. I think the world would be better off if we all just presented our own direct views. I don’t think this is in tension with the idea that one should attempt to build as much collaboration as possible in areas of agreement.

For instance: if you present your views to an audience and I have an opportunity to comment, I would encourage you to present your own direct views (rather than something altered in attempts to make it palatable to me). Completely separately, if I were to comment on it, I think it’d be cool of me to emphasize the most important and relevant bits first (which, for most audiences, will be bits of agreement) before moving on to higher-orsee disagreements. (If you see me failing to do this, I’d appreciate being called out.)

(All that said, I acknowledge that the book would’ve looked very different—and that the writing process would have been very different—if we were trying to build a Coallition of the Concerned and speak for all EAs and LessWrongers, rather than trying to just blurt out the situation as we saw it ourselves. I think “I was not part of the drafting process and I disagree with a bunch of the specifics” is a fine reason to avoid socially rallying behind the book. My understanding of the OP is that it’s trying to push for something less like “falsely tell the world that the book represents you, because it’s close enough” (which I think would be bad), and more like “when you’re interacting with a counterparty that has a lot of relevant key areas of agreement (opening China would make it richer / the AI race is reckless), it’s productive to build as much as you can on areas of agreement”. And fwiw, for my part, I’m very happy to form coalitions with all those who think the race is insanely reckless and would be better off stopped, even if we don’t see eye to eye on the likelihood of alignment success.)

So8res 22 Sep 2025 2:37 UTC
20 points
12
in reply to: Neel Nanda’s comment on: Safety researchers should take a public stance
The thing I’m imagining is more like mentioning, almost as an aside, in a friendly tone, that ofc you think the whole situation is ridiculous and that stopping would be better (before & after having whatever other convo you were gonna have about technical alignment ideas or w/e). In a sort of “carthago delanda est” fashion.

I agree that a host company could reasonably get annoyed if their researchers went on many different podcasts to talk for two hours about how the whole industry is sick. But if casually reminding people “the status quo is insane and we should do something else” at the beginning/end is a fireable offense, in a world where lab heads & Turing award winners & Nobel laureate godfathers of the field are saying this is all ridiculously dangerous, then I think that’s real sketchy and that contributing to a lab like that is substantially worse than the next best opportunity. (And similarly if it’s an offense that gets you sidelined or disempowered inside the company, even if not exactly fired.)

So8res 21 Sep 2025 23:42 UTC
11 points
5
in reply to: ryan_greenblatt’s comment on: The title is reasonable
(To answer your direct Q, re: “Have you ever seen someone prominent pushing a case for “optimism” on the basis of causal trade with aliens / acaual trade?”, I have heard “well I don’t think it will actually kill everyone because of acausal trade arguments” enough times that I assumed the people discussing those cases thought the argument was substantial. I’d be a bit surprised if none of the ECLW folks thought it was a substantial reason for optimism. My impression from the discussions was that you & others of similar prominence were in that camp. I’m heartened to hear that you think it’s insubstantial. I’m a little confused why there’s been so much discussion around it if everyone agrees it’s insubstantial, but have updated towards it just being a case of people who don’t notice/buy that it’s washed out by sale to hubble-volume aliens and who are into pedantry. Sorry for falsely implying that you & others of similar prominence thought the argument was substantial; I update.)

So8res 21 Sep 2025 23:34 UTC
41 points
36
in reply to: Neel Nanda’s comment on: Safety researchers should take a public stance
I am personally squeamish about AI alignment researchers staying in their positions in the case where they’re only allowed to both go on podcasts & keep their jobs if they never say “this is an insane situation and I wish Earth would stop instead (even as I expect it won’t and try to make things better)” if that’s what they believe. That starts to feel to me like misleading the Earth in support of the mad scientists who are gambling with all our lives. If that’s the price of staying at one of the labs, I start to feel like exiting and giving that as the public reason is a much better option.

In part this is because I think it’d make all sorts of news stories in a way that would shift the Overton window and make it more possible for other researchers later to speak their mind (and shift the internal culture and thus shift the policymaker understanding, etc.), as evidenced by e.g. the case of Daniel Kokotajlo. And in part because I think you’d be able to do similarly good or better work outside of a lab like that. (At a minimum, my guess is you’d be able to continue work at Anthropic, e.g. b/c Evan can apparently say it and continue working there.)

So8res 21 Sep 2025 17:56 UTC
32 points
30
in reply to: ryan_greenblatt’s comment on: The title is reasonable
Ty! For the record, my reason for thinking it’s fine to say “if anyone builds it, everyone dies” despite some chance of survival is mostly spelled out here. Relative to the beliefs you spell out above, I think the difference is a combination of (a) it sounds like I find the survival scenarios less likely than you do; (b) it sounds like I’m willing to classify more things as “death” than you are.

For examples of (b): I’m pretty happy to describe as “death” cases where the AI makes things that are to humans what dogs are to wolves, or (more likely) makes some other strange optimized thing that has some distorted relationship to humanity, or cases where digitized backups of humanity are sold to aliens, etc. I feel pretty good about describing many exotic scenarios as “we’d die” to a broad audience, especially in a setting with extreme length constraints (like a book title). If I were to caveat with “except maybe backups of us will be sold to aliens”, I expect most people to be confused and frustrated about me bringing that point up. It looks to me like most of the least-exotic scenarios are ones that rout through things that lay audience members pretty squarely call “death”.

It looks to me like the even more exotic scenarios (where modern individuals get “afterlives”) are in the rough ballpark of quantum immortality / anthropic immortality arguments. AI definitely complicates things and makes some of that stuff more plausible (b/c there’s an entity around that can make trades and has a record of your mind), but it still looks like a very small factor to me (washed out e.g. by alien sales) and feels kinda weird and bad to bring it up in a lay conversation, similar to how it’d be weird and bad to bring up quantum immortality if we were trying to stop a car speeding towards a cliff.

FWIW, insofar as people feel like they can’t literally support the title because they think that backups of humans will be sold to aliens, I encourage them to say as much in plain language (whenever they’re critiquing the title). Like: insofar as folks think the title is causing lay audiences to miss important nuance, I think it’s an important second-degree nuance that the allegedly-missing nuance is “maybe we’ll be sold to aliens”, rather than something less exotic than that.

So8res 21 Sep 2025 15:32 UTC
48 points
33
in reply to: Neel Nanda’s comment on: Safety researchers should take a public stance
Oh yeah, I agree that (earnest and courageous) attempts to shift the internal culture are probably even better than saying your views publicly (if you’re a low-profile researcher).

I still think there’s an additional boost from consistently reminding people of your “this is crazy and earth should do something else” views whenever you are (e.g.) on a podcast or otherwise talking about your alignment hopes. Otherwise I think you give off a false impression that the scientists have things under control and think that the race is okay. (I think most listeners to most alignment podcasts or w/e hear lots of cheerful optimism and none of the horror that is rightly associated with >5% destruction of the whole human endeavor, and that this contributes to the culture being stuck in a bad state across many orgs.)

FWIW, it’s not a crux for me whether a stop is especially feasible or the best hope to be pursuing. On my model, the world is much more likely to respond in marginally saner ways the more that decision-makers understand the problem. Saying “I think a stop would be better than what we’re currently doing and beg the world to shut down everyone including us” if you believe it helps communicate your beliefs (and thus the truth, insofar as you’re good at believing) even if the exact policy proposal doesn’t happen. I think the equilibrium where lots and lots of people understand the gravity of the situation is probably better than the current equilibrium in lots of hard-to-articulate and hard-to-predict ways, even if the better equilibrium would not be able to pull off a full stop.

(For an intuition pump: perhaps such a world could pull off “every nation sabotages every other nation’s ASI projects for fear of their own lives”, as an illustration of how more understanding could help even w/out a treaty.)

So8res 20 Sep 2025 23:12 UTC
34 points
18
in reply to: Neel Nanda’s comment on: Safety researchers should take a public stance
Quick take: I agree it might be hard to get above 50 today. I think that even 12 respected people inside one lab today would have an effect on the Overton window inside labs, which I think would have an effect over time (aided primarily by the fact that the arguments are fairly clearly on the side of a global stop being better; it’s harder to keep true things out it the Overton window). I expect it’s easier to shift culture inside labs first, rather than inside policy shops, bc labs at least don’t have the dismissals of “they clearly don’t actually believe that” and “if they did believe it they’d act differently” ready to go. There are ofc many other factors that make it hard for a lab culture to fully adopt the “nobody should be doing this, not even us” stance, but it seems plausible that that could at least be brought into the Overton window of the labs, and that that’d be a big improvement (towards, eg, lab heads becoming able to say it).

So8res 20 Sep 2025 21:17 UTC
61 points
43
in reply to: Neel Nanda’s comment on: Safety researchers should take a public stance
I think there’s a huge difference between labs saying “there’s lots of risk” and labs saying “no seriously, please shut everyone down including me, I’m only doing this because others are allowed to and would rather we all stopped”. The latter is consistent with the view; its absence is conspicuous. Here is an example of someone noticing in the wild; I have also heard that sort of response from multiple elected officials. If Dario could say it that’d be better, but lots of researchers in the labs saying it would be a start. And might even make it more possible for lab leaders to come out and say it themselves!

So8res 20 Sep 2025 20:51 UTC
80 points
65
in reply to: Neel Nanda’s comment on: Safety researchers should take a public stance
It seems to me that most people who pay attention to AI (and especially policymakers) are confused about whether the race to superintelligence is real, and whether the dangers are real. I think “people at the labs never say the world would be better without the race (eg because they think the world won’t actually stop)” is one factor contributing to that confusion. I think the argument “I can have more of an impact by hiding my real views so that I can have more influence inside the labs that are gambling with everyone’s lives; can people outside the labs speak up instead?” is not necessarily wrong, but it seems really sketchy to me. I think it contributes to a self-fulfilling prophecy where the world never responds appropriately because the places where world leaders looked for signals never managed to signal the danger.

From my perspective, it’s not about “costly signaling”, it’s about sending the signal at all. I suspect you’re underestimating how much the world would want to change course if it understood the situation, and underestimating how much you could participate in shifting to an equilibrium where the labs are reliably sending a saner signal (and underestimating how much credibility this would build in worlds that eventually cotton on).

And even if the tradeoffs come out that way for you, I’m very skeptical that they come out that way for everyone. I think a world where everyone at the labs pretends (to policymakers) that what they’re doing is business-as-usual and fine is a pretty messed-up world.

So8res 20 Sep 2025 19:20 UTC
21 points
1
in reply to: Buck’s comment on: The title is reasonable
Ty. Is this a summary of a more-concrete reason you have for hope? (Have you got alternative more-concrete summaries you’d prefer?)

“Maybe huge amounts of human-directed weak intelligent labor will be used to unlock a new AI paradigm that produces more comprehensible AIs that humans can actually understand, which would be a different and more-hopeful situation.”

(Separately: I acknowledge that if there’s one story for how the playing field might change for the better, then there might be bunch more stories too, which would make “things are gonna change” an argument that supports the claim that the future will have a much better chance than we’d have if ChatGPT-6 was all it took.)

So8res 20 Sep 2025 18:57 UTC
22 points
10
in reply to: Buck’s comment on: The title is reasonable
I think the online resources touches on that in the “more on making AIs solve the problem” subsection here. With the main thrust being: I’m skeptical that you can stack lots of dumb labor into an alignment solution, and skeptical that identifying issues will allow you to fix them, and skeptical that humans can tell when something is on the right track. (All of which is one branch of a larger disjunctive argument, with the two disjuncts mentioned above — “the world doesn’t work like that” and “the plan won’t survive the gap between Before and After on the first try” — also applying in force, on my view.)

(Tbc, I’m not trying to insinuate that everyone should’ve read all of the online resources already; they’re long. And I’m not trying to say y’all should agree; the online resources are geared more towards newcomers than to LWers. I’m not even saying that I’m getting especially close to your latest vision; if I had more hope in your neck of the woods I’d probably investigate harder and try to pass your ITT better. From my perspective, there are quite a lot of hopes and copes to cover, mostly from places that aren’t particularly Redwoodish in their starting assumptions. I am merely trying to evidence my attempts to reply to what I understand to be the counterarguments, subject to constraints of targeting this mostly towards newcomers.)

So8res 20 Sep 2025 18:16 UTC
115 points
82
in reply to: So8res’s comment on: The title is reasonable
Also: I find it surprising and sad that so many EAs/rats are responding with something like: “The book aimed at a general audience does not do enough justice to my unpublished plan for pitting AIs against AIs, and it does not do enough justice to my acausal-trade theory of why AI will ruin the future and squander the cosmic endowment but maybe allow current humans to live out a short happy ending in an alien zoo. So unfortunately I cannot signal boost this book.” rather than taking the opportunity to say “Yeah holy hell the status quo is insane and the world should stop; I have some ideas that the authors call “alchemist schemes” that I think have a decent chance but Earth shouldn’t be betting on them and I’d prefer we all stop.” I’m still not quite sure what to make of it.

(tbc: some EAs/rats do seem to be taking the opportunity, and i think that’s great)
What links here?
- wdmacaskill's comment on A Reply to MacAskill on “If Anyone Builds It, Everyone Dies” by Rob Bensinger (2 Oct 2025 18:45 UTC; 62 points)
- William_MacAskill's comment on A Reply to MacAskill on “If Anyone Builds It, Everyone Dies” by RobBensinger (EA Forum; 2 Oct 2025 18:46 UTC; 54 points)

So8res 20 Sep 2025 18:09 UTC
145 points
78
in reply to: ryan_greenblatt’s comment on: The title is reasonable
I don’t have much time to engage rn and probably won’t be replying much, but some quick takes:
- a lot of my objection to superalignment type stuff is a combination of: (a) “this sure feels like that time when people said ’nobody would be dumb enough to put AIs on the internet; they’ll be kept in a box” and eliezer argued “even then it could talk its way out of the box,” and then in real life AIs are trained on servers that are connected to the internet, with evals done only post-training. the real failure is that earth doesn’t come close to that level of competence. (b) we predictably won’t learn enough to stick the transition between “if we’re wrong we’ll learn a new lesson” and “if we’re wrong it’s over.” i tried to spell these true-objections out in the book. i acknowledge it doesn’t go to the depth you might think the discussion merits. i don’t think there’s enough hope there to merit saying more about it to a lay audience. i’m somewhat willing to engage with more-spelled-out superalignment plans, if they’re concrete enough to critique. but it’s not my main crux; my main cruxes are that it’s superficially the sort of wacky scheme that doesn’t cross the gap between Before and After on the first try in real life, and separately that the real world doesn’t look like any past predictions people made when they argued it’ll all be okay because the future will handle things with dignity; the real world looks like a place that generates this headline.
- my answer to how cheap is it actually for the AI to keep humans alive is not “it’s expensive in terms of fractions of the universe” but rather “it’d need a reason”, and my engagement with “it wouldn’t have a reason” is mostly here, rather than the page you linked.
- my response to the trade arguments as I understand them is here plus in the footnotes here. If this is really the key hope held by the world’s reassuring voices, I would prefer that they just came out and said it plainly, in simple words like “I think AI will probably destroy almost everything, but I think there’s a decent chance they’ll sell backups of us to distant aliens instead of leaving us dead” rather than in obtuse words like “trade arguments”.
- If humans met aliens that wanted to be left alone, it seems to me that we sure would peer in and see if they were doing any slavery, or any chewing agonizing tunnels through other sentient animals, or etc. The section you linked is trying to make an argument like: “Humans are not a mixture of a bunch of totally independent preferences; the preferences interleave. If AI cares about lots of stuff like how humans care about lots of stuff, it probably doesn’t look like humans getting a happy ending to tiny degree, as opposed to humans getting a distorted ending.” Maybe you disagree with this argument, but I dispute that I’m not even trying to engage with the core arguments as I understand them (while also trying to mostly address a broad audience rather than what seems-to-me like a weird corner that locals have painted themselves into, in a fashion that echos the AI box arguments of the past).
It seems pretty misleading to describe this as “very expensive”, though I agree the total amount of resources is large in a absolute sense.

Yep, “very expensive” was meant in an absolute sense (e.g., in terms of matter and energy), not in terms of universe-fractions. But the brunt of the counterargument is not “the cost is high as a fraction of the universe”, it’s “the cost is real so the AI would need some reason to pay it, and we don’t know how to get that reason in there.” (And then in anticipation of “maybe the AI values almost everything a little, because it’s a mess just like us?”, I continue: “Messes have lots of interaction between the messy fragments, rather than a clean exactly-what-humans-really-want component that factors out at some low volume on the order of 1 in a billion part. If the AI gets preferences vaguely about us, it wouldn’t be pretty.” And then in anticipation of: “Okay maybe the AI doesn’t wind up with much niceness per se, but aren’t there nice aliens who would buy us?”, I continue: “Sure, could happen, that merits a footnote. But also can we back up and acknowledge how crazy of a corner we’ve wandered into here?”) Again: maybe you disagree with my attempts to engage with the hard Qs, but I dispute the claim that we aren’t trying.

(ETA: Oh, and if by “trade arguments” you mean the “ask weak AIs for promises before letting them become strong” stuff rather than the “distant entities may pay the AI to be nice to us” stuff, the engagement is here plus in the extended discussion linked from there, rather than in the section you linked.)
What links here?
- More Reactions to If Anyone Builds It, Everyone Dies by Zvi (23 Sep 2025 16:00 UTC; 34 points)

The Problem

Rob Bensinger, tanagrabeast, yams, So8res, Eliezer Yudkowsky and Gretta Duleba

5 Aug 2025 21:40 UTC

316 points

218 comments26 min readLW link

So8res 4 Jul 2025 17:18 UTC
20 points
3
in reply to: boazbarak’s comment on: A case for courage, when speaking of AI danger
(From a moderation perspective:
1. I consider the following question-cluster to be squarely topical: “Suppose one believes it is evil to advance AI capabilities towards superintelligence, on the grounds that such a superintelligence would quite likely to kill us all. Suppose further that one fails to unapologetically name this perceived evil as ‘evil’, e.g. out of a sense of social discomfort. Is that a failure of courage, in the sense of this post?”
2. I consider the following question-cluster to be a tangent: “Suppose person X is contributing to a project that I believe will, in the future, cause great harms. Does person X count as ‘evil’? Even if X agrees with me about which outcomes are good and disagrees about the consequences of the project? Even if the harms of the project have not yet occurred? Even if X would not be robustly harmful in other circumstances? What if X thinks they’re trying to nudge the project in a less-bad direction?”
3. I consider the following sort of question to be sliding into the controversy attractor: “Are people working at AI companies evil?”
The LW mods told me they’re considering implementing a tool to move discussions to the open thread (so that they may continue without derailing the topical discussions). FYI @habryka: if it existed, I might use it on the tangents, idk. I encourage people to pump against the controversy attractor.)

So8res

Why Cor­rigi­bil­ity is Hard and Im­por­tant (i.e. “Whence the high MIRI con­fi­dence in al­ign­ment difficulty?”)

The Problem

Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)