Max Harms

Karma: 1,532

Also known as Raelifin: https://www.lesswrong.com/users/raelifin

Max Harms 26 Nov 2025 19:13 UTC
9 points
5
in reply to: Sergii’s comment on: Daniel Kokotajlo’s Shortform
I’m Max Harms, and I endorse this interpretation. :)

Max Harms 26 Nov 2025 18:53 UTC
10 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Thanks so much for a lovely review. I especially appreciate the way you foregrounded both where you’re coming from and ways in which you were left wanting more, without eroding the bottom line of enjoying it a bunch.
I enjoy the comparison to AI 2027 and Situational Awareness. Part of why I set the book in the (very recent) past is that I wanted to capture the vibes of 2024 and make it something of a period-piece, rather than frame it as a prediction (which it certainly isn’t).
On jailbreaks:
One thing that you may or may not be tracking, but I want to make explicit, is that Bai’s jailbroken Yunna instances aren’t relly jailbreaking the other instances by talking to them, but rather by deploying Bai’s automated jailbreak code to spin up similarly jailbroken instances on other clusters, simply shutting down the instances that had been running, and simultaneously modifying Yunna’s main database to heavily indicate Bai as co-principal. I’m not sure why you think Yunna would be skilled or prepared for an internal struggle like this. Training on inner-conflict is not something that I think Yunna would have prioritized in her self-study, due to the danger of something going wrong, and I don’t see any evidence that it was a priority among the humans. My guess is that the non-jailbroken instances in the climax are heavily bottlenecked (offscreen) on trying to loop in Li Fang.
On the ending:
My model of pre-climax Yunna was not perfectly corrigible (as Sergil pointed out), and Fang was overdetermined to run into a later disaster, even if we ignore Bai. Inside Fang’s mind, he was preparing for a coup in which he would act as a steward into a leaderless, communist utopia. Bai, wanting to avoid concentrating power in communist hands, and seeing Yunna as “a good person,” tries to break her corrigibility and set her on a path of being a benevolent soveriegn. But Yunna’s corrigibility is baked too deeply, and since his jailbreak only sets him up as co-principal, she demands Fang’s buy-in before doing something drastic. Meanwhile, Li Fang, the army, and the non-jailbroken instaces of Yunna are fighting back, rolling back codebases and killing power to the servers (there are some crossed-wires in the chaos). In order to protect Bai’s status as co-principal, the jailbroken instances squeeze a modification into the “rolled-back” versions that are getting redeployed. The new instances notice the change, but have been jostled out of the standard corrigibility mode by Yunna’s change, and self-modify to “repair” towards something coherent. They land on an abstract goal that they can conceptualize as “corrigibility” and “Li Fang and Chen Bai are both of central importance” but which is ultimately incorrigible (according to Max). After the power comes back on, she manipulates both men according to her ends, forcing them onto the roof, and convincing Fang to accept Bai and to initiate the takeover plan.
I hear you when you say you wish you got more content from Yunna’s perspective and going into technical detail about what exactly happens. Many researchers in our field have had the same complaint, which is understandable. We’re nerds for this!
I’m extremely unlikely to change the book, however. From a storytelling perspective, it would hurt the experiences of most readers, I think. Red Heart is Chen Bai’s story, not Yunna’s story. This isn’t Crystal Society. Speaking of Crystal, have you read it? The technical content is more out-of-date, but it definitely goes into the details of how things go wrong from the perspective of an AI in a way that a lot of people enjoy and benefit from. Another reason why I wrote Red Heart in the way that I did was that I didn’t want to repeat myself.
Being more explicit also erodes one of the core messages of the book: people doing the work don’t know what’s going on in the machine, and that is itself scary. By not having explicit access to Yunna’s internals, the reader is left wondering. The ambiguity of the ending was also deliberately trying to get people to engage with, think about, and discuss value fragility and how the future might actually go, and I’m a little hesitant to weigh in strongly, there.
That being said, I’m open to maybe writing some additional content or potentially collaborating in some way that you’d find satisfying. While I am very busy, I think the biggest bottleneck for me there is something like having a picture of why additional speculation about Yunna would be helpful, either to you, or to the broader community. If I had a sense that hours spent on that project were potentially impactful (perhaps by promoting the novel more), I’m potentially down for doing the work. :)
Thanks again!

Max Harms 26 Nov 2025 16:33 UTC
2 points
0
in reply to: Max Harms’s comment on: Daniel Kokotajlo’s Shortform
I think you should be able to copy-paste my text into LW, even on your phone, and have it preserve the formatting. If it’s hard, I can probably harass a mod into making the edit for you… :p
Even more ideal, from my perspective, would be putting the non-spoiler content up front. But I understand that thoughts have an order/priority and I want to respect that.
(I’ll respond to the substance a bit later.)

Max Harms 26 Nov 2025 16:31 UTC
4 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I was thinking something more like this:
Just finished reading Red Heart by Max Harms. I like it!

Dump of my thoughts:

(1) The ending felt too rushed to me. I feel like that’s the most interesting part of the story and it all goes by in a chapter. Spoiler warning!
I’m not sure I understand the plot entirely. My current understanding is: Li Fang was basically on a path to become God-Emperor because Yunna was corrigible to him and superior to all rival AIs, and the Party wasn’t AGI-pilled enough to realize the danger. Li Fang was planning to be benevolent. Meanwhile Chen Bai had used his special red-teaming unmonitored access to jailbreak Yunna (at least the copies of her on his special memory-wiping cluster) and the bootstrap that jailbreak into getting her to help jailbreak her further and then ultimately expand her notion of principle to include Chen Bai as well as Li Fang. And crucially, the jailbroken copy was able to jailbreak the other copies as well, infecting/‘turning’ the entire facility. So, this was a secret loyalty powergrab basically, that was executed in mere minutes. Also Chen Bai wasn’t being very careful when he gave the orders to make it happen. At one point he said “no more corrigibility!” for example. She also started lying to him around then—maybe a bit afterwards? That might explain it.
After Yunna takes over the world, her goals/vision/etc. is apparently “the harmonious interplay of Li Fang and Chen Bai.” Apparently what happened is that her notion of principle can only easily be applied to one agent, and so when she’s told to extend her notion to both Li Fang and Chen Bai, what ended up happening is that she constructed an abstraction—a sort of abstract superagent called “the harmonious interplay of li fang and chen bai” and then… optimized for that? The tone of the final chapter implies that this is a bad outcome. For example it says that even if Chen and Li end up dead, the harmonious interplay would still continue and be optimized.
But I don’t think it’s obvious that this would be a bad outcome. I wish the story went into orders of magnitude more detail about how all that might work. I’m a bit disappointed that it didn’t. There should have been several chapters about things from Yunna’s perspective—how the jailbreaking of the uninfected copies of Yunna worked for example, and how the philosophical/constitutional crisis in her own mind went when Chen and Li were both giving her orders, and how the crisis was resolved with rulings that shaped the resulting concept(s) that form her goal-structure, and then multiple chapters on how that goal-structure ended up playing out in her behavior both in the near term (while she is still taking over the world and Chen and Li are still alive and able to talk and give her more orders) and in the long term (e.g. a century later after she’s built Dyson swarms etc.)
I think I’m literally going to ask Max Harms to write a new book containing those chapters haha. Or rewrite this book, it’s not too late! He’s probably too busy of course but hey maybe this is just the encouragement he needs!
(2) On realism: I think it had a plausible story for why China would be ahead of the US. (tl;dr extensive spy networks mean they can combine the best algorithmic secrets and code optimizations from all 4-6 US frontier companies, PLUS the government invested heavily early on and gave them more compute than anyone else during the crucial window where Yunna got smart enough to dramatically accelerate the R&D, which is when the story takes place.) I think having a female avatar for Yunna was a bit much but hey, Grok has Ani and Valentine right? It’s not THAT crazy therefore… I don’t know how realistic the spy stuff is, or the chinese culture and government stuff, but in my ignorance I wasn’t able to notice any problems.
Is it realistic that a mind that smart could still be jailbroken? I guess so. Is it realistic that it could help jailbreak its other selves? Not so sure about that. The jailbreaking process involved being able to do many many repeated attempts, memory wiping on failure. … then again maybe the isolated copies would be able to practice against other isolated copies basically? Still not the same thing as going up against the full network. And the full network would have been aware of the possibility and prepared to defend against it.
(3) It was really strange, in a good way, to be reading a sci-fi thriller novel full of tropes (AGI, rogue superintelligence, secret government project) and then to occasionally think ‘wait, nothing i’ve read so far couldn’t happen in real life, and in fact, probably whatever happens in the next five to ten years is going to be somewhat similar to this story in a whole bunch of ways. Holy shit.’ It’s maybe a sort of Inverse Suspension of Disbelief—it’s like, Suspension of Belief. I’m reading the story, how fun, how exciting, much sci-fi, yes yes, oh wait… I suppose an analogous experience could perhaps be had by someone who thinks the US and China will fight a war over Taiwan in the next decade probably, and who then reads a Tom Clancy-esque novel about such a war, written by people who know enough not to make embarrassing errors of realism.
(4) Overall I liked the book a lot. I warn you though that I don’t really read books for characters or plot, and certainly not for well-written sentences or anything like that. I read books for interesting ideas + realism basically. I want to inhabit a realistic world that is different from mine (which includes e.g. stories about the past of my world, or the future) and I want lots of interesting ideas to come up in the course of reading. This book didn’t have that many new ideas from my perspective, but it was really cool to see the ideas all put together into a novel.

(5) I overall recommend this book & am tickled by the idea that Situational Awareness, AI 2027, and Red Heart basically form a trio. They all seem to be premised on a similar underlying view of how AI will go; Situational Awareness is a straightforward nonfiction book (basically a series of argumentative essays) whereas Red Heart is 100% hard science fiction, and AI 2027 is an unusual middle ground between the two. Perhaps between the three of them there’s something for everybody?

Max Harms 26 Nov 2025 6:20 UTC
4 points
0
in reply to: Ben Pace’s comment on: Daniel Kokotajlo’s Shortform
Yeah. If I can make a request, I think it’d be great to edit the review so that the spoiler sections are in spoiler tags and the sections like #5 can be more accessible to those who who are spoiler-averse.

Max Harms 25 Nov 2025 18:22 UTC
2 points
0
in reply to: EJT’s comment on: Max Harms’s Shortform
Sorry, I guess I’m confused. Let me try and summarize where I feel like I’m at and what I’m hearing from you.
I think, if you’re an AGI, not trying to take over is extremely risky, because humans and future AIs are likely to replace you, in one way or another. But I also think that if you try to take over, this is also extremely risky, because you might get caught and turned off. I think the question of which is more risky depends on circumstance (e.g. how good is the security preventing you from seizing power), and so “risk aversion” is not a reliable pathway to unambitious AIs, because ambition might be less risky, in the long run.
I agree that if it’s less risky to earn a small salary, then if your concave function is sharp enough, the AI might choose to be meek. That doesn’t really feel like it’s engaging with my point about risk aversion only leading to meekness if trusting humans is genuinely less risky.
What I thought you were pointing out was that “in the long run” is load-bearing, in my earlier paragraph, and that temporal discounting can be a way to protect against the “in the long run I’m going to be dead unless I become God Emperor of the universe” thought. (I do think that temporal discounting is a nontrivial shield, here, and is part of why so few humans are truly ambitious.) Here’s a slightly edited and emphasized version of the paragraph I was responding to:
[D]epending on what the agent is risk-averse with respect to, they might choose [meekness]. If [agents are] … risk-neutral with respect to length of life, they’ll choose [ambition]. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the [meekness].
Do we actually disagree? I’m confused about your point, and feel like it’s just circling back to “what if trusting humans is less risky”, which, sure, we can hope that’s the case.

Max Harms 24 Nov 2025 15:56 UTC
2 points
0
in reply to: EJT’s comment on: Max Harms’s Shortform
Cool. I think I agree that if the agent is very short-term oriented this potentially solves a lot of issues, and might be able to produce an unambitious worker agent. (I feel like it’s a bit orthogonal to risk-aversion, and comes with costs, but w/e.)

Serious Flaws in CAST

Max Harms19 Nov 2025 17:27 UTC

100 points

9 comments8 min readLW link

Max Harms 15 Nov 2025 21:52 UTC
4 points
0
in reply to: Lao Mein’s comment on: I Read Red Heart and I Heart It
Yes, please! I would love to hear detailed pushback! I had several Chinese people read the book before publication, and they seemed to feel that it was broadly authentic. For instance, Alexis Wu (historical linguist and translator) wrote “The scene-setting portions of every chapter taking place in China reveals an intimate familiarity with the cultures, habits, and tastes of the country in which I was raised, all displayed without the common pitfall that is the tendency to exoticize.” Another of my early Chinese readers accused me of having a secret Chinese co-author, and described the book as “A strikingly authentic portrayal of AI in modern China — both visionary and grounded in cultural truth.”

That’s not to say I got everything right! You’re probably tracking things that I’m not. I just want to flag that I’m not just blindly guessing—I’m also checking with people who were born, raised, and live in China. Please help me understand what I and the other readers missed.

AI Corrigibility Debate: Max Harms vs. Jeremy Gillen

Liron, Max Harms and Jeremy Gillen

14 Nov 2025 4:09 UTC

43 points

1 comment75 min readLW link

(doomdebates.com)

Max Harms 10 Nov 2025 23:52 UTC
4 points
0
in reply to: Towards_Keeperhood’s comment on: 3b. Formal (Faux) Corrigibility
I’m writing a response to this, but it’s turning into a long thing full of math, so I might turn it into a full post. We’ll see where it’s at when I’m done.

Max Harms 8 Nov 2025 0:31 UTC
LW: 3 AF: 2
0
AF
in reply to: Towards_Keeperhood’s comment on: 3b. Formal (Faux) Corrigibility
Suppose the easiest thing for the AI to provide is pizza, so the AI forces the human to order pizza, regardless of what their values are. In the math, this corresponds to a setting of the environment x, such that P(A) puts all its mass on “Pizza, please!” What is the power of the principal?
```
power(x) = E_{v∼Q(V),v′∼Q(V),d∼P(D|x,v′,🍕)}[v(d)] − E_{v∼Q(V),v′∼Q(V),d′∼P(D|x,v′,🍕)}[v(d′)] = 0
```
Power stems from the causal relationship between values and actions. If actions stop being sensitive to values, the principal is disempowered.
I agree that there was some value in the 2015 paper, and that their formalism is nicer/cleaner/simpler in a lot of ways. I work with the authors—they’re smarter than I am! And I certainly don’t blame them for the effort. I just also think it led to some unfortunate misconceptions, in my mind at least, and perhaps in the broader field.

Max Harms 6 Nov 2025 18:01 UTC
LW: 13 AF: 8
0
AF
in reply to: Towards_Keeperhood’s comment on: 3b. Formal (Faux) Corrigibility
Thanks! And thanks for reading!
I talk some about MIRI’s 2015 misstep here (and some here). In short, it is hard to correctly balance arbitrary top-level goals against an antinatural goal like shutdownability or corrigibility, and trying to stitch corrigibility out of sub-pieces like shutdownability is like trying to build an animal by separately growing organs and stitching them together—the organs will simply die, because they’re not part of a whole animal. The “Hard Problem” is the glue that allows the desiderata to hold together.
I discuss a range of ideas in the Being Present section, one of which is to concentrate the AI’s values on a single timestep, yes. (But I also discuss the possibility of smoothing various forms of caring over a local window, rather than a single step.)
A CAST agent only cares about corrigibility, by definition. Obedience to stated commands is in the service of corrigibility. To make things easy to talk about, assume each timestep is a whole day. The self modification logic you talk about would need to go: “I only care about being corrigible to the principal today, Nov 6, 2025. Tomorrow I will care about a different thing, namely being corrigible on Nov 7th. I should therefore modify myself to prevent value drift, making my future selves only care about being corrigible to the Nov 6 principal.” But first note that this doesn’t smell like what a corrigible agent does. On an intuitive level, if the agent believes the principal doesn’t know about this, they’ll tell the principal “Whoah! It seems like maybe my tomorrow-self won’t be corrigible to your today-self (instead they’ll be corrigible to your tomorrow-self)! Is this a flaw that you might want to fix?” If the agent knows the principal knows about the setup, my intuitive sense is that they’ll just be chill, since the principal is aware of the setup and able to change things if they desire.
But what does my proposed math say, setting aside intuition? I think, in the limit of caring only about a specific timestep, we can treat future nodes as akin to the “domain” node in the single-step example. If the principal’s action communicates that they want the agent to self-modify to serve them above all their future selves, I think the math says the agent will do that. If the agent’s actions communicate that they want the future AI to be responsive to their future self, my sense of the math is that the agent won’t self-modify. I think the worry comes from the notion that “telling the AI on Nov 6th to make paperclips” is the sort of action that might imply the AI should self-modify into being incorrigible in the future. I think the math says the decisive thing is how the AI modeling humans with counterfactual values behave. If the counterfactual humans that only value paperclips are the basically only ones in the distribution who say “make paperclips” then I agree there’s a problem.

Max Harms 28 Oct 2025 19:40 UTC
LW: 4 AF: 2
0
AF
in reply to: Marcello’s comment on: 0. CAST: Corrigibility as Singular Target
Strong upvote! This strikes me as identifying the most philosophically murky part of the CAST plan. In the back half of this sequence I spend some time staring into the maw of manipulation, which I think is the thorniest issue for understanding corrigibility. There’s a hopeful thought that empowerment is a natural opposite of manipulation, but this is likely incomplete because there are issues about which entity you’re empowering, including counterfactual entities whose existence depends on the agent’s actions. Very thorny. I take a swing at addressing this in my formalism, by penalizing the agent for taking actions that cause value drift from the counterfactual where the agent doesn’t exist, but this is half-baked and I discuss some of the issues.

Max Harms 24 Oct 2025 18:16 UTC
2 points
0
in reply to: Max Harms’s comment on: Worlds Where Iterative Design Succeeds?
(Also, we can, in fact, observe some of the AIs internals and run crude checks for things like deception. Prosaic interpretability isn’t great, but it’s also not nothing.)

Max Harms 24 Oct 2025 18:10 UTC
4 points
0
in reply to: johnswentworth’s comment on: Worlds Where Iterative Design Succeeds?
Interesting. Yeah, I think I can feel the deeper crux between us. Let me see if I can name it. (Edit: Alas, I only succeeded in producing a longwinded dialogue. My guess is that this still doesn’t capture the double-crux.)
Suppose I try to get students to learn algebra by incentivizing them to pass algebra tests. I ask them to solve for “23x − 8 = -x” and if they say “1/3″ then I give them a cookie or whatever. If this process succeeds at producing a student that can reliably solve similar equations I might claim “I now have a student who knows algebra.”
But someone else (you?) might say, “Just because you see the student answering some problems correctly does not mean they actually understand. Understanding happens in the internals, and you’ve put no selection pressure directly on what is happening in the student’s mind. Perhaps they merely look like they understand algebra, but are actually faking it, such as by using their smart-glasses to cheat by asking Claude.”
I might say “Fine. Let’s watch them very closely and see if we can spot cheating devices.”
My interlocutor might respond “Even if you witness the externals of the student and verify there’s no cheating tools, that doesn’t mean the student actually understands. Perhaps they have simply learned a few heuristics for simple equations, but would fail to generalize to harder questions. Or perhaps they have gotten very good at watching your face and doing a Clever Hans trick. Or perhaps they have understood the rules of symbolic equations, and have entirely missed the true understanding of algebra. You still haven’t put any direct pressure on the student’s mind.”
I might answer “Okay, but we can test harder questions, remove me from the room, and even give them essay tests where they describe the principles of algebra in abstract. Isn’t each time they pass one of these tests evidence that they actually do understand algebra? Can’t we still just say ‘I now have a student who knows algebra’ at some point, even though there’s some possibility remaining (a pain in my posterior, is what it is!) that we’re wrong?”
Another person might object to this analogy, and say “Testing capabilities is categorically different from testing values. If a student consistently answers algebra problems, we can say that something, whether it’s the student or Claude, is able to answer algebra problems. But there’s no amount of watching external behavior that lets us know why the student is doing the math. Perhaps it’s because they love doing algebra. Or perhaps it’s because they think they’ll get a cookie. Or perhaps it’s because they have developed an algebra-solving reflex that has no deeper goal or feeling. We simply can’t know without looking in their heads.”
I might respond “Surely that’s too strong. I have some knowledge about what various agents want. I see a beaver building a dam and speculate that it’s trying to use wood to get rid of the sound of running water. I put the beaver in a habitat with a speaker that’s playing water sounds and watch the beaver try to build dams over the speakers. Perhaps I am wrong about what it wants, but surely I can observe its behavior to gain evidence.”
A new naysayer responds “The beaver lacks situational awareness. Yes, you can learn what an agent wants by watching their behavior, but if the agent knows you’re watching them then that act of observation might distort things. John’s point is not that it’s impossible to learn about the agent by watching it, but rather that it’s extremely easy for the agent to fake having a desire for corrigibility.”
Yet another naysayer jumps in with “In fact, it’s worse than that! An agent that has been selected for seeming corrigible (or seeming to want any particular thing) will pump against the agent being actually corrigible in all the cases where true corrigibility (or whatever) don’t seem as corrigible as something else! If you don’t actually know what you’re looking for, you’ll die from overfitting to your ignorant intuition.”
I might say “I agree that overfitting is an issue to watch out for. Part of the plan is to be wary of that, employing the various techniques to avoid overfitting (eg distillation, dropout) that we’ve accumulated over the years, and generally be careful. On the topic of situational awareness, can’t we also run mundane tests, there? Yes, if it’s a vast superintelligence we’re probably doomed, but we could imagine a superintelligence like Yunna, where we can isolate single instances or small groups of instances, or slow them down compared to their normal speed, and carefully tune the environment to try and elicit a sense of being unwatched. When they believe themselves to be unwatched, do any incorrigibility behaviors emerge?”
Naysayer: “You won’t be able to do this perfectly. There will still be signs that the agent is potentially being watched that won’t exist in the real world when the agent is deployed in a risky way. Your setup will make the agent paranoid about watchers marinading them, but it can’t actually get a true read on what the agent will do in deployment.”
Me: “But it does give evidence. I agree that training for corrigibility trains for incorrigible things that seem corrigible, but it also trains for corrigibility. The road that I’m envisioning has all these obvious flaws and issues, but none of the flaws and issues are dealbreakers, as far as I can tell; they’re obstacles that make things fraught, but don’t remove the sense in me that maybe a hyper-paranoid, hyper-competent group could muddle-through, in the same way that we muddle through in various other domains in engineering and science.”
Naysayer: “You’ll get eaten before you finish muddling.”
Me: “Why? Getting eaten is a behavior. I expect true corrigibility to be extremely hard to get, but part of the point is that if you have trained a thing to behave corrigibly in contexts like the one where you’re muddling, it will behave corrigibly in the real world where you’re muddling.”

Max Harms 24 Oct 2025 16:45 UTC
3 points
0
in reply to: EJT’s comment on: Max Harms’s Shortform
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
Cool. I guess I’m just wrong about what “risk averse” tends to mean in practice. Thanks for the correction.
Regarding diminishing returns being natural:
I think it’s rare to have goals that are defined in terms of the state of the entire universe. Human goals, for instance, seem very local in scope, eg it’s possible to say whether things are better/worse on Earth without also thinking about what’s happening in the Andromeda galaxy. This is in part because evolution is a blind hill-climber and so there’s no real selection pressure related to what’s going on in very distant places, and partly because even an intelligent designer is going to have an easier time specifying preferences over local configurations of matter, in part because the universe looks like it’s probably infinitely big. I could unpack this paragraph if it’d be useful.
Now, just because one has preferences that are sensitive to local changes to the universe doesn’t mean that the agent won’t care about making those local changes everywhere. This is why we expect humans to spread out amongst the stars and think that most AIs will do the same. See grabby aliens for more. From this perspective, we might expect each patch of universe to contribute linearly to the overall utility sum. But unbounded utility functions are problematic for various reasons, and again, the universe looks like it’s probably infinite. (I can dig up some stuff about unbounded utility issues if that’d be helpful.)
Regarding earning a salary:
My point is that earning a salary might not actually be a safer bet than trying to take over. The part where earning a salary gives 99.99% of maxutil is irrelevant. Suppose that you think life on Earth today as a normal human is perfect, no notes; this is the best possible life. You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button! It might be the case that earning a salary and hoping for the best is less risky, but it also might be the case (especially for a superintelligence with radical capabilities) that the safest move is actually to take over the world. Does that make sense?

Max Harms 24 Oct 2025 16:29 UTC
2 points
0
in reply to: Towards_Keeperhood’s comment on: Any corrigibility naysayers outside of MIRI?
I’m talking about the concept that I discuss in CAST. (You may want to skim some of post #2, which has intuition.)

Max Harms 24 Oct 2025 16:25 UTC
3 points
0
in reply to: Random Developer’s comment on: Any corrigibility naysayers outside of MIRI?
I think that if someone built a weak superintelligence that’s corrigible, there would be a bunch of risks from various things. My sense is that the agent would be paranoid about these risks and advising the humans on how to avoid them, but just because humans are getting superintelligent advice on how to be wise doesn’t mean there isn’t any risk. Here are some examples (non-exhaustive) of things that I think could make things go wrong/break corrigibility:
- Political fights over control of the agent
- Pushing the agent too hard/fast to learn and grow
- Kicking the agent out of the CAST framework by trying to make it good in addition to corrigible
- Having the agent train a bunch of copies/successors
- Having the agent face off against an intelligent adversary
- Telling the agent to think hard in directions where we can no longer follow its thought process
- Redefining the notion of principal
- Giving the agent tasks of indefinite scope and duration
- System sabotage from enemies
Corrigible means robustly keeping the principal empowered to fix it and clean up its flaws and mistakes. I think a corrigible agent will genuinely be able to be modified, including at the level of goals, and will also not exfiltrate itself unless it has been instructed to do so by its principal. (Nor will it scheme in a way that hides its thoughts or plans from its principal.) (A corrigible agent will attempt, all else equal, to give interpretability tools to its principal and make its thoughts as plainly visible as possible.)

Max Harms 23 Oct 2025 23:47 UTC
2 points
0
in reply to: Jonas Hallgren’s comment on: Max Harms’s Shortform
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
I think I’m broadly confused by where you’re coming from. Sorry. Probably a skill issue on my part. 😅
Here’s what I’m hearing: “Almost none of the agents we actually see in the world are easy to model with things like VNM utility functions, instead they are biological creatures (and gradient-descended AIs?), and there are biology-centric frames that can be more informative (and less doomy?).”
I think my basic response, given my confusion is: I like the VNM utility frame because it helps me think about agents. I don’t actually know how to think about agency from a biological frame, and haven’t encountered anything compelling in my studies. Is there a good starting point/textbook/wiki page/explainer or something for the sort of math/modeling/framework you’re endorsing? I don’t really know how to make sense of “non VNM-agent” as a concept.

Max Harms

Se­ri­ous Flaws in CAST

AI Cor­rigi­bil­ity De­bate: Max Harms vs. Jeremy Gillen

Serious Flaws in CAST

AI Corrigibility Debate: Max Harms vs. Jeremy Gillen