Max Harms

Karma: 1,941

Also known as Raelifin: https://www.lesswrong.com/users/raelifin

Max Harms 2 Jul 2026 23:13 UTC
3 points
0
in reply to: Warty’s comment on: P(doom) is a Dumb Meme
Rational, honest people can be made worse off by communicating if they misinterpret each other (or if one or more is mistaken). Misinterpretation is the generalized name for the failure I’m describing. Rational people can dodge it, but only by being careful how they speak and how they listen—the opposite of casually throwing around numbers. You don’t avoid it by simply “being rational.”

Max Harms 30 Jun 2026 19:51 UTC
3 points
0
in reply to: PeterMcCluskey’s comment on: P(doom) is a Dumb Meme
I agree that many things that “outside view” might mean include things that aren’t aggregating the perspectives of others, and I’m being a bit sloppy in the main text by implying that this is the main thing that an outside view involves. But I do think that it’s pretty important to distinguish views that stem from different sorts of mental stances so that one can avoid double-counting where possible.

Max Harms 29 Jun 2026 19:19 UTC
2 points
0
in reply to: mattmacdermott’s comment on: P(doom) is a Dumb Meme
Thanks. I’ll update my language to be more clear.

Max Harms 29 Jun 2026 17:19 UTC
2 points
1
in reply to: MichaelDickens’s comment on: P(doom) is a Dumb Meme
Perhaps I should have been more clear about why vagueness is bad. It’s fine for statements to be vague. What is bad is when something is ambiguous and people don’t realize it’s ambiguous. They treat “P(doom) = 40%” as a clear statement, where they would (hopefully!) recognize that they don’t really understand the perspective of someone who says “I’m worried about AI risk.”
You definitely can have a probability on an outcome you have significant control over. When you predict, you are choosing. 0% chance of “Zimbabwe” is downstream of making a choice, and once the choice is made, it’s fine to reflect on the chances. But note that if you aren’t paying attention to your power, making a forecast can mask the fact that you have power.
On a related note, while you can’t choose your beliefs directly, you can choose your actions, and thereby choose your beliefs about what you will do, and what will follow. Insofar as you choose your actions based on your beliefs (which you shouldn’t do unless you first condition on various choices of action!), then your beliefs about the future will have multiple fixed-points, dependent on your choices. See: https://www.lesswrong.com/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic

Max Harms 29 Jun 2026 17:04 UTC
2 points
0
in reply to: artifex0’s comment on: P(doom) is a Dumb Meme
I think vague statements are fine. There’s only so much communication bandwidth, after all. But I don’t think “I take the risk of catastrophe due to ASI pretty seriously” is encouraging miscommunication with its vagueness. The danger in “P(doom)” comes from different parties having a different sense of what’s being discussed and not realizing that they’re talking past each other.
I do think people should be wary about double-counting evidence in general. I try, for example, to distinguish between how things seem from my perspective, from whether I think something is worth taking seriously. For example, building datacenters in space, from my perspective, seems idiotic. But others seem to take it seriously, so I wouldn’t bet hard against it actually being smart.
And I think that fatalism is sometimes an issue with how people talk about the future, outside of the meme/ASI space. I do not think “I’m worried about climate change” is at all bad—seems like a reasonable thing for someone to say! But I do think “there’s a 50% chance that the temperature will rise by 2 degrees this century” is problematically fatalist, and it should be amended with a conditional (eg “If we keep going down this path, there’s a 50%...”).
I don’t claim to have a good counter-meme (if I did I would’ve included it), but questions like “How worried are you that superintelligent AI could wipe out human civilization?” or “Where do you stand on existential risks from AI?” seem pretty good. They don’t suggest people give a number, which is part of the point. Numbers are good when there’s enough specificity to understand what it is that they’re measuring. If you and the other person are clearly on the same page about what a number means, and communicating that number won’t be taken as a vote of no-confidence or otherwise damage an effort to coordinate, then by all means, share the number. My issue here is not with numbers, but with numbers poorly used.

Max Harms 16 Jun 2026 20:48 UTC
2 points
0
in reply to: Дмитрий Зеленский’s comment on: 3a. Towards Formal Corrigibility
Ignore might have been the wrong word. Mainly I think it’s a really bad situation if the AI is taking the human’s unreflective thoughts as directives, rather than giving the human space to think about what’s best before directing the AI. If the AI responds, for example, to words in my internal verbal loop, then I worry that I cannot actually control it well.

Max Harms 19 May 2026 17:29 UTC
3 points
0
in reply to: Дмитрий Зеленский’s comment on: 3a. Towards Formal Corrigibility
Ah, good point. That’s an inconsistency!
My sense is that it’s good to have a distinct word for “context → external action” mappings, and a different word for “context → actions+internal state changes”. We want the AI to ignore the human’s thoughts, and not treat them as actions, but then also still track that they’re part of what minds do...

Max Harms 13 May 2026 16:39 UTC
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
Fair enough. And I certainly agree that there is a lot of bathwater! The bundle of connotations attached to the word “wanting” is a mess. I just want to flag that it seems to me that much of the normal ontology can be rescued, albeit with a little bit of work. I claim that concepts like corrigibility are still useful and coherent once the rescuing has taken place.

Max Harms 13 May 2026 16:33 UTC
LW: 3 AF: 1
0
AF
in reply to: Steven Byrnes’s comment on: Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
I haven’t written at length about the distinction between terminal and instrumental goals myself (there’s a bit at the start of CAST, but I don’t belabor it), but I think Eliezer did a good job in 2007. In my own words, I would say that it makes some sense to divide the planning system of the mind into a portion that is a model of the world, where it makes sense to talk about truth and so on, and another section of the mind that is about judging the desirability of various potential world states and/or trajectories. That second portion (or an important component of it) is what I would call the “values” of the agent, and when the values are put in contact with concrete outcomes that are judged highly compared to others, I would call them terminal goals. Instrumental goals are then constructed as a second-order operation on top of one or more terminal goals (and the dynamics of the world model), so that we can shortcut planning as a question of how to first get the instrumental goal so that we can later move from that state to the terminal goal.
As a concrete example, I wanted to go home after work last night (which is itself an instrumental goal in the service of many other terminal goals, such as comfort, but which we can treat as terminal). I planned to drive in my car through a small town to get home, and thus steered my car towards the town, because “get to the town” was instrumental to my (more) terminal goal. As I approached, I found out that there had been an accident and that the road was closed. If my world model had included this fact, I would not have identified “get to the town” as an instrumental goal. Once I was aware of it, I changed my plan so that I drove down a country detour that went around the town.
I do not consider learning about the accident to have changed my values or the way that I judged outcomes. Instead, it changed my plan.
Does that make sense?

Max Harms 12 May 2026 20:37 UTC
LW: 18 AF: 7
12
AF
on: Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
I think you’re correctly identifying important issues and cracks in the standard ontology, but I think you’re throwing out too much baby in an effort to get rid of bathwater.
For example, I do not think it’s obvious that “Just like vitalistic force, ‘wanting’ is conceptualized as being acausal, i.e. an intrinsic property of an entity with no upstream cause.” In control theory, we can say that a system controls for a thing based on a small collection of mathematical relationships—pressuring an error signal towards zero. While the concept of wanting is overloaded and more complex, I think it makes sense to recognize that “X is controlling for Y” is a valid underpinning that has no vitalistic magic. We can ask what led X to control for Y, or how X controls for Y in terms that are closer to the underlying physics; there’s nothing acausal or intrinsic (except for the definitions, I suppose).

Max Harms 12 May 2026 20:17 UTC
LW: 3 AF: 2
0
AF
in reply to: Raelifin’s comment on: Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
Oh, uh. Whoops. Forgot to switch to my work account.

Max Harms 12 May 2026 17:08 UTC
2 points
0
in reply to: Дмитрий Зеленский’s comment on: 3a. Towards Formal Corrigibility
Yes? And I’m noting that both (maintaining the distinction) exist in the output channel?

Max Harms 11 May 2026 16:05 UTC
2 points
0
in reply to: Дмитрий Зеленский’s comment on: 1. The CAST Strategy
Can you say more? I don’t think I understand.
(Just to be clear, I think having self-modification core/terminal desires is fine and possibly ~stable. My point was more that the occasional attempt to “pray the gay away” etc are unlikely to occur in Utopia.)

Max Harms 18 Mar 2026 23:14 UTC
15 points
0
on: Less Dead
I made some Manifold markets to bet on Nectome!
Let me know if you have an idea for a market and you want me to make it.

Max Harms 26 Feb 2026 18:19 UTC
4 points
0
in reply to: StanislavKrym’s comment on: moridinamael’s Shortform
Thanks for tagging me. I took a look, and am glad for Matt’s efforts in trying clever, new approaches.
My main take is that this operates on a pretty different level than CAST, and I would personally be hesitant to say it produces corrigibility. (In Eliezer-lingo I would say “it doesn’t engage with the hard problem”.) I’d be more inclined to say it produces an agent that is extremely deferent. (My sense, by contrast, is that truly corrigible agents proactively surface important facts to their principal, which is not something I see coming from MOADT.) This is fine; deference is an important desideratum, and if MOADT can get it, then it sorta doesn’t matter if it also gets the other corrigibility desiderata in the process. But I don’t see any solutions to the open problems around CAST here.
Just to weigh in a little on MOADT itself, in case it’s helpful:
- I am not convinced by the “drop completeness” frame on VNM. From my perspective it looks like a null action (and maybe also a “check with the principal” action) is implicitly getting inserted into all situations and the true utility function that describes the agent is to prefer that null action over taking any non-null action that is dicey and hasn’t been explicitly approved. Maybe this is a good utility function to have, since it creates something fairly docile, but it still looks to me like it can be described as VNM.
- The biggest issue, I predict, is that the agent seems like it will be too docile/deferent to do meaningful work in reasonable situations. For example, if any distribution in the credal set assigns nonzero probability to all logically-possible outcomes, my guess is that any hard constraint will cause the agent to have a null action set and shut down. I would think about ways to soften this. More generally, I think if the principal has to constantly babysit the agent, the “alignment tax” will be too high and the AI will basically turn into a rock with “What should I do?” written on it. (This is too harsh. The presentation of options alongside analyses of how things trade-off can be helpful. But still, that feels more like an oracle than an agent. :shrug:)
- Most of the work seemed pretty solid, in terms of writing quality and clarity. Some definitely has “LLM smell”. I think trying to isolate the core (human) idea from the AI generated expansions might be good? I’m definitely glad I knew there was some slop flavor going in and Matt was aware of that, as it helped me not get too turned off by the occasional part that felt stylistically vapid. I buy that the LLM assistance was net helpful, which is cool to note on the meta-level.

Max Harms 5 Feb 2026 13:26 UTC
4 points
2
in reply to: Logan Zoellner’s comment on: Bentham’s Bulldog is wrong about AI risk
IABIED is a 101-level book written for the general public that was deliberately kept nice and short. I kinda think anyone (who is not an expert) who reads IABIED and comes away with a similar level of pessimism as the authors is making an error. If you read any single book on a wild, controversial topic, you should not wind up extremely confident!
My sense is that the point of the book was to convince people that it’s important to take AI x-risk seriously (as BB does). I don’t really think it was intended to get people to think it’s title thesis is clearly true.
Some things are hard to judge.

Max Harms 3 Feb 2026 19:00 UTC
2 points
0
in reply to: Logan Zoellner’s comment on: Bentham’s Bulldog is wrong about AI risk
“The evidence isn’t convincing” is a fine and true statement. I agree that IABI did not convince BB that the title thesis is clearly true. (Arguably that wasn’t the point of the book, and it did convince him that it was worryingly plausible and worth spending more attention on AI x-risk, but that’s pure speculation on my part and idk.)
My point is that “the evidence isn’t convincing” is (by default) a claim about the evidence, not the hypothesis. It is not a reason to disbelieve.
I agree^[1] that sometimes having little evidence or only weak evidence should be an update against. These are cases where the hypothesis predicts that you will have compelling evidence. If the hypothesis were “it is obvious that if anyone builds it, everyone dies” then I think the current lack of consensus and inconclusive evidence would be a strong reason to disbelieve. This is why I picked the example with the stars/planets. It, I claim, is a hypothesis that does not predict you’ll have lots of easy evidence on Old Earth, and in that context the lack of compelling evidence is not relevant to the hypothesis.
I’m not sure if there’s a clearer way to state my point.^[2] Sorry for not being easier to understand.
Perhaps relevant: MIRI thinks that it’ll be hard to get consensus on AGI before it comes.
1. ^
  As indicated in my final parenthetical paragraph, I my comment above:
  (There are also cases where the “absence of evidence” is evidence of absence. But these are just null results, not a real absence of evidence. It seems fine to criticize an argument for doom that predicted we’d see all AIs the size of Claude being obviously sociopathic.)
2. ^
  We could try expressing things in math if you want. Like, what does the update on the book being unconvincing look like in terms of Bayesian probability?

Max Harms 2 Feb 2026 16:55 UTC
4 points
2
in reply to: Logan Zoellner’s comment on: Bentham’s Bulldog is wrong about AI risk
Sorry, I think you entirely missed my point. It seems my choice of hypothesis was distracting. I’ve edited my original comment to make that more clear. My point does not depend on the truth of the claim.

Max Harms 31 Jan 2026 13:36 UTC
13 points
4
in reply to: Logan Zoellner’s comment on: Bentham’s Bulldog is wrong about AI risk
Suppose that, in the years before telescopes, I came to you and said that [wild idea X] was true.^[1]
You’d be right to wonder why I think that. Now suppose that I offer some convoluted philosophical argument that is hard to follow (perhaps because it’s invalid). You are not convinced.
If you write down a list of arguments, for and against the idea, you could put my wacky argument in the “for” column, or not, if you think it’s too weak to be worth consideration. But what I am claiming would be insane is to list “lack of proof” as an argument against.
Lack of proof is an observation about the list of arguments, not about the idea itself. It’s a meta-level argument masquerading as an object level argument.
Let’s say on priors you think [X] is 1% likely, and your posterior is pretty close after hearing my argument. If someone asks you why you don’t believe, I claim that the most precise (and correct) response is “my prior is low,” not “the evidence isn’t convincing,” since the failure of your body of evidence is not a reason to disbelieve in the hypothesis.
Does that make sense?
(Admittedly, I think it’s fine to speak casually and not worry about this point in some contexts. But I don’t think BB’s blog is such a context.)
(There are also cases where the “absence of evidence” is evidence of absence. But these are just null results, not a real absence of evidence. It seems fine to criticize an argument for doom that predicted we’d see all AIs the size of Claude being obviously sociopathic.)
1. ^
  Edit warning! In the original version of this comment X = “the planets are other worlds, like ours, and a bunch of them have moons.” My point does not depend on the specific X.

Max Harms 31 Jan 2026 0:25 UTC
4 points
0
in reply to: TFD’s comment on: Bentham’s Bulldog is wrong about AI risk
I claim that even in the case of the murder rate, you don’t actually care about posterior probabilities, you care about evidence and likelihood ratios (but I agree that you should care about their likelihoods!). If you are sure that you share priors with someone, like with sane people and murder rates, their posterior probability lets you deduce that they have strong evidence that is surprising to you. But this is a special case, and certainly doesn’t apply here.
Posterior probabilities can be a reasonable tool for getting a handle on where you agree/disagree with someone (though alas, not perfect since you might incidentally agree because your priors mismatch in exactly the opposite way that your evidence does), but once you’ve identified that you disagree you should start double-clicking on object-level claims and trying to get a handle on their evidence and what likelihoods it implies, rather than criticizing them for having the wrong bottom-line number. If Eliezer’s prior is 80% and Bentham’s Bulldog has a prior of 0.2%, it’s fine if they have respective posteriors of 99% and 5% after seeing the same evidence.
One major exception is if you’re trying to figure out how someone will behave. I agree that in that case you want to know their posterior, all-things-considered view. But that basically never applies when we’re sitting around trying to figure things out.
Does that make sense?