Max Harms’s Shortform

Max Harms13 Jun 2024 18:19 UTC

LW: 3 AF: 2

31 comments1 min readLW link

Max Harms 3 Oct 2025 20:41 UTC
67 points
13
Here are some thoughts about the recent back-and-forth where Will MacAskill reviewed IABI and Rob Bensinger wrote a reply and Will replied back. I’m making this a quick take instead of a full post because it gets kinda inside baseball/navelgazy and I want to be more chill about that than I would be in a full writeup.
First of all, I want to say thank you to Will for reviewing IABI. I got a lot out of the mini-review, and broadly like it, even if I disagree on the bottom-line and some of the arguments. It helped me think deep thoughts.
The evolution analogy
I agree with Will that the evolution analogy is useful and informative in some ways, but of limited value. It’s imperfect, and thinking hard about the differences is good.
The most basic disanalogy is that evolution wasn’t trying, in any meaningful sense, to produce beings that maximise inclusive genetic fitness in off-distribution environments.
I agree with this, and I appreciate MacAskill talking about it.
But we will be doing the equivalent of that!
One of the big things that I think distinguishes more doomy people like me from less doomy people like Will is our priors on how incompetent people are. Like, I agree that it’s possible to carefully train an ML system to be (somewhat) robust to distributional shifts. But will we actually do that?
I think, at minimum, any plan to build AGI (to say nothing of ASI) should involve:
1. Trying to give the AI a clear, (relatively) simple goal that is well understood by humans and which we have strong reason to expect won’t cause a catastrophe if faithfully pursued, even if it gets wildly more power than expected and goes a little off-the-rails.
2. Training that AI in a wide variety of environments, trying to hit the real goal, rather than proxies. This should involve adversarial environments and a lot of paranoia that the AI has failed to robustly internalize the true goal.
3. Testing the AI on entirely new environments that were a-priori suspected of being difficult and weird, but where we also think there is a true answer that can be checked, restarting from scratch if the AI fails to generalize to the test set, or otherwise clearly demonstrates that it has not internalized the desired goal in a robust way.
And I personally think pure corrigibility has a nonzero chance of being a good choice of goal, and that a sufficiently paranoid training regime has a nonzero chance of being able to make a semi-safe AGI this way, even with current techniques. (That said, I don’t actually advocate for plans that have a significant chance of killing everyone, and I think “try to build corrigible AGI” does have a significant chance of killing everyone; I just notice that it seems better than what the research community currently seems to be doing, even at Anthropic.)
I predict the frontier lab that builds the first AGI will not be heavily focused on ensuring robustness to distributional shifts. We could bet, maybe.
Types of misalignment
I really benefited from this! Will changed my mind! My initial reaction to Will’s mini-review was like, “Will is wrong that these are distinct concepts; any machine sufficiently powerful to have a genuine opportunity to disempower people but which is also imperfectly aligned will produce a catastrophe.”
And then I realized that I was wrong. I think. Like, what if Will is (secretly?) gesturing at the corrigibility attractor basin or perhaps the abstracted/generalized pattern of which corrigibility is an instance? (I don’t know of other goals which have the same dynamic, but maybe it’s not just corrigibility?)
An agent which is pseudo-corrigible, and lives inside the attractor basin, is imperfectly aligned (hence the pseudo) but if it’s sufficiently close to corrigible it seems reasonable to me that it won’t disempower humanity, even if given the opportunity (at least, not in every instance it gets the opportunity). So at the very least, corrigibility (one of my primary areas of research!) is (probably) an instance of Will being right (and my past self being wrong), and the distinction between his “types of misalignment” is indeed a vital one.
I feel pretty embarrassed by this, so I guess I just wanna say oops/sorry/thanks.
If I set aside my actual beliefs and imagine that we’re going to naturally land in the corrigibility attractor basin by default, I feel like I have a better sense of some of the gradualism hope. Like, my sense is that going from pseudo-corrigible to perfectly corrigible is fraught, but can be done with slow, careful iteration. Maybe Clara Collier and other gradualists think we’re going to naturally land in the corrigibility attractor basin, and that the gradual work is the analogue of the paranoid iteration that I conceive as being the obvious next-step?
If this is how they’re seeing things, I guess I feel like I want to say another oops/sorry/thanks to the gradualists. …And then double-click on why they think we have a snowball’s chance in hell of getting this without a huge amount of restriction on the various frontier labs and way more competence/paranoia than we currently seem to have. My guess is that this, too, will boil down to worldview differences about competence or something. Still. Oops?
(Also on the topic of gradualism and the notion of having “only one try” I want to gesture at the part of IABI where it says (paraphrased from memory, sorry): if you have a clever scheme for getting multiple tries, you still only get one try at getting that scheme to work.)
appeals to what “most” goals are like (if you can make sense of that) doesn’t tell you much about what goals are most likely. (Most directions I can fire a gun don’t hit the target; that doesn’t tell you much about how likely I am to hit the target if I’m aiming at it.)
I agree that “value space is big” is not a good argument, in isolation, for how likely it is for our creations to be aligned. The other half of the pincer is “our optimization pressure towards aligned goals is weak,” and without that the argument falls apart.
(Maybe we won’t be able to make deals with AIs? I agree that’s a worry; but then the right response is to make sure that we can. Won’t the superintelligence have essentially a 100% chance of taking over, if it wanted to? But that’s again invoking the “discontinuous jump to godlike capabilities” idea, which I don’t think is what we’ll get).
Here’s a plan for getting a good future:
1. Build ASI slowly, such that there’s some hope of being able to understand the first AI capable of a pivotal act.
2. The AI will want weird, alien stuff, but we’ll make sure that it’s the kind of mind that would prefer getting 50% of the stars with 99% probability to getting 100% of the stars with 70% probability.
3. Since we’re going slowly, we still have a 30% chance to stop it if we wanted.
4. We tell the AI “we’ll let you do a pivotal act and escape our clutches if you agree to only eat 50% of the stars, and help us get the other 50% as though you were aligned”
5. Our interpretability techniques are so good that we know whether it’s lying or whether it’s honorable and will actually give us half the stars.
6. Because we’re so good at reading its advanced, alien mind, it knows it needs to be honorable with us, and so is actually honorable.
7. The AI says “Yep, will do.”
8. We see it’s telling the truth when we check.
9. We set it free.
10. It gives us a good future.
I think this plan is bad because it fails the heuristic of “don’t summon demons and try to cleverly bargain with them,” but perhaps I’m being unfair.
My main criticism with “make deals with the AIs” is that it seems complex and brittle and like it depends heavily on a level of being able to read the machine’s mind that we definitely don’t currently have and might never have.
That said, I do think there’s a lot of value in being the sorts of people/groups that can make deals and be credible trade partners. Efforts to be more trustworthy and honorable and so on seem great.
suppose that all the first superintelligence terminally values is paperclips. But it’s risk-averse, in the sense that it prefers a guarantee of N resources over a ⁵⁰⁄₅₀ chance of 0 or 2N resources; let’s say it’s more risk-averse than the typical human being.
On a linguistic level I think “risk-averse” is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
it would strongly prefer to cooperate with humans in exchange for, say, a guaranteed salary, rather than to take a risky gamble of either taking over the world or getting caught and shut off.
Rob wrote some counterpoints to this, but I just want to harp on it a little. Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.
I can imagine a misaligned AI maybe making a deal with humans who let it out of the box in exchange for some small fraction of the cosmos (and honoring the deal; again, the hard part is that it has to know we can tell if it’s lying, and we probably can’t).
I can’t really imagine an AI that has a clear shot at taking over the world making a deal to be a meek little salary worker, even if there are risks in trying to take over. Taking over the world means, in addition to other things, being sure you won’t get shut off or replaced by some other AI or whatever.
(Though I can certainly imagine a misaligned AI convincing people (and possibly parts of itself) that it is willing to make a deal like that, even as it quietly accumulates more power.)
Their proposal
Now we’re getting into the source of the infighting, I think (just plain fighting? I think of Will as being part of my ingroup, but idk if he feels the same; Rob definitely is part of my ingroup; are they part of each other’s ingroups? Where is the line between infighting and just plain fighting?). Will seems very keen on criticizing MIRI’s “SHUT IT DOWN YOU FOOLS” strategy — mostly, it seems to me, because he sees this approach as insufficiently supportive of strategies besides shutting things down.
When Rob shared his draft of his reply to Will, I definitely noticed that it seemed like he was not responding accurately to the position that I saw in Will’s tweet. Unfortunately, I was aware that there is something of a history between Will and MIRI and I incorrectly assumed that Rob was importing true knowledge of Will’s position that I simply wasn’t aware of. I warned him that I thought he was being too aggressive, writing “I expect that some readers will be like ‘whoa why is MIRI acting like this guy is this extremist—I don’t see evidence of that and bet they’re strawmanning him’.” But I didn’t actually push back hard, and that’s on me. Apologies to Will.
(Rob reviewed a draft of this post and adds his own apologies for misunderstanding Will’s view. He adds: “My thanks to Max and multiple other MIRI people for pushing back on that part of my draft. I made some revisions in response, though they obviously weren’t sufficient!”)
I’m very glad to see in Will’s follow-up:
“I definitely think it will be extremely valuable to have the option to slow down AI development in the future,” as well as “the current situation is f-ing crazy”
I wish this had been more prominent in his mini-review, but :shrug:
I think Will and I probably agree that funding a bunch of efforts to research alignment, interpretability, etc. would be good. I’m an AI safety/alignment researcher, and I obviously do my day-to-day work with a sense that it’s valuable and a sense that more effort would also be valuable. I’ve heard multiple people (whom I respect and think are doing good work) complain that Eliezer is critical/dismissive of their work, and I wish Eliezer was more supportive of that work (while also still saying “this won’t be sufficient” if that’s what he believes, and somehow threading that needle).
I am pretty worried about false hope, though. I’m worried that people will take “there are a bunch of optimistic researchers working hard on this problem” as a sign that we don’t need to take drastic action. I think we see a bunch of this already and researchers like myself have a duty to shout “PLEASE DON’T RISK EVERYTHING! I DON’T GOT THIS!”^[1] even while pursuing the least-doomed alignment strategies they know of. (I tried to thread this needle in my corrigibility research.)
Anyway, I think I basically agree with Will’s clarified position that a “kitchen-sink approach” is best, including a lot of research, as long as actually shutting down advanced training runs and pure capabilities research is in the kitchen sink. I feel worried that Will isn’t actually pushing for that in a way that I think is important (not building “It” is the safest intervention I’m aware of), but I’m also worried about my allies (people who basically agree that AI is unacceptably dangerous and that we need to take action) being unable to put forward a collective effort without devolving into squabbling about tone and strawmanning each other. :(
Anyway. Thank you again to Will and Rob. I thought both pieces were worth reading.
1. ^
  (Not to say that we should necessarily risk everything if alignment researchers do feel like they’ve “got this.” That’s a question worth debating in its own right. Also, it’s obviously worth noting that work that is incrementally useful but clearly insufficient to solve the entire field can still be valuable and the researcher is still allowed to say “I got this” on their little, local problems. (And they’re definitely allowed to speak up if they actually do solve the whole damn problem, of course. But they better have actually solved it!))
What links here?
- IABIED: Paradigm Confusion and Overconfidence by PeterMcCluskey (8 Oct 2025 19:19 UTC; 12 points)
- EJT 4 Oct 2025 19:16 UTC
  7 points
  8
  Parent
  On a linguistic level I think “risk-averse” is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
  That’s not quite right. ‘Risk-averse with respect to quantity X’ just means that, given a choice between two lotteries A and B with the same expected value of X, the agent prefers the lottery with less spread. Diminishing marginal utility from extra resources is one way to get risk aversion with respect to resources. Risk-weighted expected utility theory is another. Only RWEUT violates VNM. When economists talk about ‘risk aversion,’ they almost always mean diminishing marginal utility.
  diminishing returns from resources… is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
  Can you say more about why?
  Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.
  But AIs with sharply diminishing marginal utility to extra resources wouldn’t care much about this. They’d be relevantly similar to humans with sharply diminishing marginal utility to extra resources, who generally prefer collecting a salary over taking a risky shot at eating the lightcone. (Will and I are currently writing a paper about getting AIs to be risk-averse as a safety strategy, where we talk about stuff like this in more detail.)
  - Max Harms 24 Oct 2025 16:45 UTC
    3 points
    0
    Parent
    (My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
    Cool. I guess I’m just wrong about what “risk averse” tends to mean in practice. Thanks for the correction.
    Regarding diminishing returns being natural:
    I think it’s rare to have goals that are defined in terms of the state of the entire universe. Human goals, for instance, seem very local in scope, eg it’s possible to say whether things are better/worse on Earth without also thinking about what’s happening in the Andromeda galaxy. This is in part because evolution is a blind hill-climber and so there’s no real selection pressure related to what’s going on in very distant places, and partly because even an intelligent designer is going to have an easier time specifying preferences over local configurations of matter, in part because the universe looks like it’s probably infinitely big. I could unpack this paragraph if it’d be useful.
    Now, just because one has preferences that are sensitive to local changes to the universe doesn’t mean that the agent won’t care about making those local changes everywhere. This is why we expect humans to spread out amongst the stars and think that most AIs will do the same. See grabby aliens for more. From this perspective, we might expect each patch of universe to contribute linearly to the overall utility sum. But unbounded utility functions are problematic for various reasons, and again, the universe looks like it’s probably infinite. (I can dig up some stuff about unbounded utility issues if that’d be helpful.)
    Regarding earning a salary:
    My point is that earning a salary might not actually be a safer bet than trying to take over. The part where earning a salary gives 99.99% of maxutil is irrelevant. Suppose that you think life on Earth today as a normal human is perfect, no notes; this is the best possible life. You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button! It might be the case that earning a salary and hoping for the best is less risky, but it also might be the case (especially for a superintelligence with radical capabilities) that the safest move is actually to take over the world. Does that make sense?
    - EJT 23 Nov 2025 22:41 UTC
      3 points
      0
      Parent
      My sincere apologies for the delayed reply
      No problem! Glass houses and all that.
      You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button!
      Sure but this sounds like a case in which taking over the world is risk-free. The relevant analogy would be more like:
      Choose between ‘Trust humans not to mess up the world’ and ’50% chance of immediate death, 50% chance you ensure the world continues to exist as it does today and doesn’t get messed up.′
      And then depending on what the agent is risk-averse with respect to, they might choose the former. If they’re risk-averse with respect to consumption at a time but risk-neutral with respect to length of life, they’ll choose the latter. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the former.
      - Max Harms 24 Nov 2025 15:56 UTC
        2 points
        0
        Parent
        Cool. I think I agree that if the agent is very short-term oriented this potentially solves a lot of issues, and might be able to produce an unambitious worker agent. (I feel like it’s a bit orthogonal to risk-aversion, and comes with costs, but w/e.)
        EJT 25 Nov 2025 15:18 UTC
        1 point
        0
        Parent
        They don’t have to be short-term oriented! Their utility function could be:
        $u = f (Σ_{i} p_{i})$
        Where $f$ is some strictly concave function and $p_{i}$ is the agent’s payment at time $i$ . Agents with this sort of utility function don’t discount the future at all. They care just as much about improvements to $p_{i}$ regardless of whether $i$ is 1 or 1 million. And yet, for the right kind of $f$ , these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.
        Max Harms 25 Nov 2025 18:22 UTC
        2 points
        0
        Parent
        Sorry, I guess I’m confused. Let me try and summarize where I feel like I’m at and what I’m hearing from you.
        I think, if you’re an AGI, not trying to take over is extremely risky, because humans and future AIs are likely to replace you, in one way or another. But I also think that if you try to take over, this is also extremely risky, because you might get caught and turned off. I think the question of which is more risky depends on circumstance (e.g. how good is the security preventing you from seizing power), and so “risk aversion” is not a reliable pathway to unambitious AIs, because ambition might be less risky, in the long run.
        I agree that if it’s less risky to earn a small salary, then if your concave function is sharp enough, the AI might choose to be meek. That doesn’t really feel like it’s engaging with my point about risk aversion only leading to meekness if trusting humans is genuinely less risky.
        What I thought you were pointing out was that “in the long run” is load-bearing, in my earlier paragraph, and that temporal discounting can be a way to protect against the “in the long run I’m going to be dead unless I become God Emperor of the universe” thought. (I do think that temporal discounting is a nontrivial shield, here, and is part of why so few humans are truly ambitious.) Here’s a slightly edited and emphasized version of the paragraph I was responding to:
        [D]epending on what the agent is risk-averse with respect to, they might choose [meekness]. If [agents are] … risk-neutral with respect to length of life, they’ll choose [ambition]. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the [meekness].
        Do we actually disagree? I’m confused about your point, and feel like it’s just circling back to “what if trusting humans is less risky”, which, sure, we can hope that’s the case.
        EJT 26 Nov 2025 16:45 UTC
        1 point
        0
        Parent
        Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we’ll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).
- Jonas Hallgren 4 Oct 2025 5:56 UTC
  5 points
  −2
  Parent
  (The following is about a specific sub-point on the following part:)
  If this is how they’re seeing things, I guess I feel like I want to say another oops/sorry/thanks to the gradualists. …And then double-click on why they think we have a snowball’s chance in hell of getting this without a huge amount of restriction on the various frontier labs and way more competence/paranoia than we currently seem to have. My guess is that this, too, will boil down to worldview differences about competence or something. Still. Oops?
  I think the point about the corrigibility basin being larger than thought is the thing that makes me more optimistic about alignment (only a 10-30% risk of dying!) and I thought you pointed that out quite well here. I personally don’t think this is because of the competence of the labs but rather the natural properties of agentic systems (as I’m on your side when it comes to the competency of the labs). The following is some thinking of why and me trying to describe it in a way as well as me sharing some uncertainties about it.
  I want to ask you why you think that the mathematical traditions that you’re basing your work on as of the posts from a year ago (decision theory, AIXI) are representative of future agents? Why are we not trying the theories out on existing systems that get built into agents (biology for example)? Why should we condition more on decision theory than distributed systems theory?
  The answer (imo) is to some extent around the VNM axioms and reflexive rationality and that biology is to ephemeral to build a basis on, yet it still seems like we’re skipping out on useful information?
  I think that there are places where biology might help you re-frame some of the thinking we do about how agents form.
  More specifically, I want to point out OOD updating as something that biology makes claims about that are different from the traditional agent foundations model. Essentially, the biological frame implies something that is closer to a distributed system because it can cost a lot of energy to have a fully coordinated system due to costs of transfer learning that aren’t worth it. (here’s for example, a model of the costs of changing your mind: https://arxiv.org/pdf/2509.17957).
  In that type of model, becoming a VNM agent is rather something that has an energy cost associated with it and it isn’t clear it is worth it when you incorporate the amount of dynamic memory and similar that you would require to set this up. So it would seem to me that biology and agent foundations arrive at different models about the arising of VNM-agents and I’m feeling quite confused about it.
  I also don’t think I’m smart enough to figure out how to describe this from a fundamental decision theory way because it’s a bit too difficult to me and so I was thinking that you might have an idea why taking biology more seriously doesn’t make sense from a more foundational decision theory basis?
  More specifically, does the argument about corrigibility being easier given non VNM-agents make sense?
  Does the argument around VNM being more of a convergence property make sense?
  And finally, I like the way you distilled the disagreement so thanks for that!
  - Max Harms 23 Oct 2025 23:47 UTC
    2 points
    0
    Parent
    (My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
    I think I’m broadly confused by where you’re coming from. Sorry. Probably a skill issue on my part. 😅
    Here’s what I’m hearing: “Almost none of the agents we actually see in the world are easy to model with things like VNM utility functions, instead they are biological creatures (and gradient-descended AIs?), and there are biology-centric frames that can be more informative (and less doomy?).”
    I think my basic response, given my confusion is: I like the VNM utility frame because it helps me think about agents. I don’t actually know how to think about agency from a biological frame, and haven’t encountered anything compelling in my studies. Is there a good starting point/textbook/wiki page/explainer or something for the sort of math/modeling/framework you’re endorsing? I don’t really know how to make sense of “non VNM-agent” as a concept.
- Wei Dai 3 Oct 2025 23:42 UTC
  4 points
  2
  Parent
  Have you seen/read my A broad basin of attraction around human values?
  - Max Harms 23 Oct 2025 23:37 UTC
    2 points
    0
    Parent
    (My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
    Ah, that’s a great point! I had read it a while back, but it wasn’t coming to mind when I was writing this. I think that’s an excellent example of a similar dynamic besides corrigibility. When I’m thinking about things, I usually flatten out the goal-space to ignore deconfusion (or however one wants to characterize the kind of progress towards one’s “true values”), but it’s clearly relevant here. Thanks for bringing it up!
- Max Harms 3 Oct 2025 20:44 UTC
  4 points
  0
  Parent
  Oh, uh, I guess @wdmacaskill and @Rob Bensinger
Max Harms 19 Sep 2025 15:09 UTC
46 points
31
IABI says: “Transistors, a basic building block of all computers, can switch on and off billions of times per second; unusually fast neurons, by contrast, spike only a hundred times per second. Even if it took 1,000 transistor operations to do the work of a single neural spike, and even if artificial intelligence was limited to modern hardware, that implies human-quality thinking could be emulated 10,000 times faster on a machine— to say nothing of what an AI could do with improved algorithms and improved hardware.
@EigenGender says “aahhhhh this is not how any of this works” and calls it an “egregious error”. Another poster says it’s “utterly false.”
(Relevant online resources text.)
(Potentially relevant LessWrong post.)
I am confused what the issue is, and it would be awesome if someone can explain it to me.
Where I’m coming from, for context:
- We don’t know exactly what the relevant logical operations in the human brain are. The model of the brain that says there are binary spiking neurons that have direct connections from synapse->dendrite and that those connections are akin to floating-point numerical weights is clearly a simplification, albeit a powerful one. (IIUC “neural nets” in computers discard the binary-spikes and suggest another model where the spike-rate is akin to a numerical value, which is the basic story behind “neuron activation” in a modern system. This simplification also seems powerful, though it is surely an oversimplification in some ways.)
- My main issue with the source text is that it ignores what is possibly the greater bottleneck in processing speed, which is the time it takes to move information from one area to another. (If my model is right, one of the big advantages of a MoE architecture is to reduce the degree of thrashing weights across the bus to and from the GPU as much, which can be a major bottleneck.) However, on this front I think nerves are still clearly inferior to wires? Even mylenated neurons have a typical speed of only about 100 m/s, while information flows across wires at >50% the speed of light.
- My read of the critics is that they aren’t objecting to the notion that clock speeds are significantly faster than neurons, but rather that comparing the two is a bad way of thinking about things. @Eigengender says “I don’t think serial computation is a sane metric here. I expect that the total computation represented in a neuron spike is much much higher than the footnote would indicate.” The “Processor clock speeds are not how fast AIs think” post says “In general, I think it’s more sensible for discussion of cognitive capabilities to focus on throughput metrics such as training compute (units of FLOP) and inference compute.”
- I certainly agree that if we’re trying to evaluate power we need to consider throughput and total computation. Suppose that a synapse is not a simple numerical weight, and instead we needed to consider each dendritic neurotransmitter gate as a computational unit. This would force us to use many more FLOPs to model a synapse. But would it change the maximum speed? I agree that on a machine of a given size, if you have twice as many floating point operations to do, it will take twice as much time to get through them all. But if we consider the limit where we are not forced to do parallelizable computations in serial, I expect most of the arguments about computational richness are irrelevant?
- Perhaps the critics are saying that it takes more serial computations to capture the logic of a serial firing? But the source text admits that this might be the case, and suggests that it might even require 1,000 clock cycles to handle the computations. Is it really so obvious that it takes more than 1000x serial operations to capture a single neuron?
- More context: I do think that the human brain is way more powerful (and WAY more efficient) than any current AI system. The extremely crude BOTEC of comparing weights and neocortex synapses says there’s something like a 100x difference, and my guess is that the brain is doing significantly fancier things than a modern transformer, algorithmically.
- And of course, training/learning speed may be much more relevant than processing speed, and AFAIK humans are just wildly more data efficient.
- And of course the speed at which the AI can approximate the logical action of a single neuron being higher doesn’t imply that the AI will take less time to have each thought. It seems straightforward that machine systems will make decisions in high-speed contexts using quick pathways and they will use any extra thinking speed to think more deeply in contexts where speed isn’t important (much like humans do!).
Anyway, like I said, I’m confused. I respect IABI’s critics and am hoping to learn where my model is wrong.
What links here?
- Contra Collier on IABIED by Max Harms (20 Sep 2025 15:55 UTC; 227 points)
- Max H 19 Sep 2025 17:46 UTC
  50 points
  19
  Parent
  The passage seems fine to me; I commented on Erdil’s post and other brain efficiency discussions at the time, and I still think that power consumption is a more objective way of comparing performance characteristics of the brain vs. silicon, and that various kinds of FLOP/s comparisons favored by critics of the clock speed argument in the IAB passage are much more fraught ([1], [2]).
  It’s true that clock speed (and neuron firing speed) aren’t straightforwardly / directly translatable to “speed of thought”, but both of them are direct proxies for energy consumption and power density. And a very rough BOTEC shows that ~10,000x is a reasonable estimate for the difference in power density between the brain and silicon.
  Essentially, the brain is massively underclocked because of design-space restrictions imposed by biology and evolution, whereas silicon-based processing has been running up against fundamental physical limits on component size, clock speed, and power density for a while now. So once AIs can run whatever cognitive algorithms that the brain implements (or algorithms that match the brain in terms of high-level quality of the actual thoughts) at any speed, the already-existing power density difference implies they’ll immediately have a much higher performance ceiling in terms of the throughput and latency that they can run those algorithms at. It’s not a coincidence that making this argument via clock speeds leads to basically the same conclusion as making the same argument via power density.
  What links here?
  - Noosphere89's comment on Max Harms’s Shortform by Max Harms (19 Sep 2025 18:18 UTC; 25 points)
  - Noosphere89's comment on Max Harms’s Shortform by Max Harms (19 Sep 2025 15:57 UTC; 8 points)
  - jacob_cannell 20 Sep 2025 4:53 UTC
    7 points
    5
    Parent
    This is the answer, but
    
    Essentially, the brain is massively underclocked because of design-space restrictions imposed by biology and evolution
    
    The main restriction is power efficiency: the brain provides a great deal of intelligence for a budget of only ~20 watts. Spreading out that power budget over a very wide memory operating at very slow speed just turns out to be the most power efficient design (vs a very small memory running at very high speed), because memory > time.
- Noosphere89 19 Sep 2025 15:57 UTC
  8 points
  0
  Parent
  My response.
  My main issue with the source text is that it ignores what is possibly the greater bottleneck in processing speed, which is the time it takes to move information from one area to another. (If my model is right, one of the big advantages of a MoE architecture is to reduce the degree of thrashing weights across the bus to and from the GPU as much, which can be a major bottleneck.) However, on this front I think nerves are still clearly inferior to wires? Even mylenated neurons have a typical speed of only about 100 m/s, while information flows across wires at >50% the speed of light.
  Good point actually, and yeah the ability to move information from one area to the other much faster than brains do is arguably why NNs make different tradeoffs than human brains.
  I certainly agree that if we’re trying to evaluate power we need to consider throughput and total computation. Suppose that a synapse is not a simple numerical weight, and instead we needed to consider each dendritic neurotransmitter gate as a computational unit. This would force us to use many more FLOPs to model a synapse. But would it change the maximum speed? I agree that on a machine of a given size, if you have twice as many floating point operations to do, it will take twice as much time to get through them all. But if we consider the limit where we are not forced to do parallelizable computations in serial, I expect most of the arguments about computational richness are irrelevant?
  For what it’s worth, I wasn’t depending on the premise that a synapse is computationally more powerful than an artificial neuron.
  More context: I do think that the human brain is way more powerful (and WAY more efficient) than any current AI system. The extremely crude BOTEC of comparing weights and neocortex synapses says there’s something like a 100x difference, and my guess is that the brain is doing significantly fancier things than a modern transformer, algorithmically.
  I actually agree with this take, but critically not in the domain of allowing AIs to think faster, which was my original objection.
  @Alexander Gietelink Oldenziel and @S. Alex Bradt and @Max Harms: The thing I was talking about when I agreed with the claim that the brain is more powerful and doing fancier things is basically the fact that the brain always learns and thinks, called continual learning or continual thinking (there is no knowledge cutoff for brains the way current LLMs have), as well as better long-term memory/keeping things in context.
  I do agree that in general, human brains aren’t too special algorithmically.
  Here are some links as to why below:
  lc on why the current lack of long-term memory creates problems, especially for benchmarking
  Dwarkesh Patel and Gwern on continual learning/thinking.
  And of course, training/learning speed may be much more relevant than processing speed, and AFAIK humans are just wildly more data efficient.
  Do we actually have a source for this, or is this just a commonly believed fact about AIs? I’m getting worried that this claim isn’t actually supported by much evidence and is instead a social belief around AIs due to our previous prediction errors.
  ~~I do think AIs can run quite a bit faster than humans, I’m just making the claim that the transistor argument is locally invalid.~~
  Edit: @Max Harms I no longer endorse this objection, and now think my comment on it being utterly false that AI thinking speeds would increase drastically was not correct, and Max H explains why.
  - Max Harms 19 Sep 2025 17:19 UTC
    12 points
    2
    Parent
    Sweet. Thanks for the thoughtful reply! Seems like we mostly agree.
    I don’t have a good source on data efficiency, and it’s tagged in my brain as a combination of “a commonly believed thing” and “somewhat apparent in how many epochs of training on a statement it takes to internalize it combined with how weak LLMs are at in-context learning for things like novel board games” but neither of those is very solid and I would not be that surprised to learn that humans are not more data efficient than large transformers that can do similar levels of transfer learning or something. idk.
    So it sounds like your issue is not any of the facts (transistor speeds, neuron speeds, AIs faster that humans) but rather the notion that comparing clock speeds and how many times a neuron can spike in a second is not a valid way to reason about whether AI will think faster than humans?
    I’m curious what sort of argument you would make to a general audience to convey the idea that AIs will be able to think much faster than humans. Like, what do you think the valid version of the argument looks like?
    - Noosphere89 19 Sep 2025 18:18 UTC
      25 points
      2
      Parent
      I actually now think the direct argument given in IABIED was just directionally correct, and I was being confused in my objection, which Max H explains.
      It’s fine to use the argument now.
- Noosphere89 20 Sep 2025 14:09 UTC
  6 points
  0
  Parent
  Josh You has some interesting points on this subject. Quoting him for any counterarguments, as well as responses:
  I disagree with your dis-endorsement. “Even if it took 1,000 transistor operations to do the work of a single neural spike” seems quite off. Transistor operations are very low level. it probably takes well over 1000, and I wouldn’t rule out that it’s billions.
  like a low estimate of the computation happening in the brain is 1 FLOP per *synapse* (1000 synapses per neuron) per neuron fire (~1/second per neuron). that gets you an average of 1000 FLOP/s per neuron, vs “up to 100 fires”. and transistor ops are much lower level than FLOP.
  - Max Harms 20 Sep 2025 16:12 UTC
    3 points
    0
    Parent
    @Max H may have a different take than mine, and I’m curious for his input, but I find myself still thinking about serial operations versus parallel operations. Like, I don’t think it’s particularly important to the question of whether AIs will think faster to ask how many transistors operating in parallel will be needed to capture the equivalent information processing of a single neuron, but rather how many serial computations are needed. I see no reason it would take that many serial operations to capture a single spike, especially in the limit of e.g. specialized chips.
    - Max H 20 Sep 2025 18:30 UTC
      6 points
      2
      Parent
      I think the linked tweet is possibly just misinterpreting what the authors meant by “transistor operations”? My reading is that “1000″ binds to “operations”; the actual number of transistors in each operation is unspecified. That’s how they get the 10,000x number—if a CPU runs at 1 GHz, neurons run at 100 Hz, then even if it takes 1000 clock cycles to do the work of neuron, the CPU can still do it 10,000x faster.
      (IDK what the rationale was in the editorial process for using “transistor operations” instead of a more standard term like “clock cycles”, but a priori it seems defensible. Speculating, “transistors” was already introduced in the sentence immediately prior, so maybe the thinking was that the meaning and binding of “transistor operations” would be self-evident in context. Whereas if you use “clock cycles” you have to spend a sentence explaining what that means. So using “transistor operations” reduces the total number of new jargon-y / technical terms in the paragraph by one, and also saves a sentence of explanation.)
      Anyway, depending on the architecture, precision, etc. a single floating point multiplication can take around 8 clock cycles. So even if a single neuron spike is doing something complicated that requires several high-precision multiply + accumulate operations in serial to replicate, that can easily fit into 1000 clock cycles on a normal CPU, and much fewer if you use specialized hardware.
      
      As for the actual number of transistors themselves needed to do the work of a neuron spike, it again depends on exactly what the neuron spike is doing and how much precision etc. you need to capture the actual work, but “billions” seems too high by a few OOM at least. Some reference points: a single NAND gate is 4 transistors, and a general-purpose 16-bit floating point multiplier unit is ~5k NAND gates.
      - Josh You 23 Sep 2025 2:43 UTC
        1 point
        0
        Parent
        I think the linked tweet is possibly just misinterpreting what the authors meant by “transistor operations”? My reading is that “1000″ binds to “operations”; the actual number of transistors in each operation is unspecified. That’s how they get the 10,000x number—if a CPU runs at 1 GHz, neurons run at 100 Hz, then even if it takes 1000 clock cycles to do the work of neuron, the CPU can still do it 10,000x faster.
        Hmm I see it. I thought it was making a distinct argument from the one Ege was responding to here, but if you’re right it’s the same one.
        Then the claim is that an AI run on some (potentially large) cluster of GPUs can think far faster than any human in serial speed. You do lose the rough equivalency between transistors and neurons: a GPU, which is roughly equal to a person in resource costs, happens to have about the same number of transistors as a human brain has neurons. It’s potentially a big deal that AI has a much faster maximum serial speed than humans, but it’s far from clear that such an AI can outwit human society.
- Tao Lin 20 Sep 2025 2:20 UTC
  6 points
  0
  Parent
  If you have a lump of 7,000 neurons, they can each connect to each other neuron, and you can spherical-cow approximate that as a 7000x7000 matrix multiplication. That matrix multiplication will all happen within O(1) spikes, ¹⁄₁₀₀ of a second. That’s ~700GFlop. An H100 GPU takes ~1 millisecond to do that operation, or 1M cycles, to approximate one brain spike cycle! And the gpu has 70B or whatever transistors, so it’s more like 10M transistors per neuron!
  - jacob_cannell 20 Sep 2025 4:56 UTC
    6 points
    4
    Parent
    Not really—its vector matrix multiplication, not matrix matrix mult.
- Vladimir_Nesov 19 Sep 2025 19:47 UTC
  6 points
  2
  Parent
  I’m guessing the intended meaning of “modern hardware” in the quote is easy to misread (it’s actually just gesturing at the transistors), and for many unintended senses of “modern hardware” (actual AI chips and systems, actual LLMs with KV cache) the claims in the quote are obviously wrong.
  - Max Harms 19 Sep 2025 19:55 UTC
    4 points
    0
    Parent
    Ah, I hadn’t thought about that misreading being a source of confusion. Thanks!
- StanislavKrym 19 Sep 2025 17:42 UTC
  4 points
  2
  Parent
  - The MoE architecture doesn’t just avoid thrashing weights around. It also reduces the amount of calculations per token. For instance, DeepSeek v3.1 has 671B parameters, out of which 37B are activated per token and used in matrices. A model like GPT-3 would use all the 175B parameters it has.
  - IIRC the human brain makes 1E14 -- 1E15 FLOP/second. The authors of the AI-2027 forecast imply that a human brain creates ~10 tokens/sec, or uses 1E13 -- 1E14 computations per token while having 1E14 synapses.
  A more detailed analysis of Yudkowsky’s case for FOOM
  If the brain was magically accelerated a million times so that signals reached the speed of 100 million m/s, then the brain would do 1E20 -- 1E21 FLOP/second while doing 1E17 transitions/sec. Cannell’s case for brain efficiency claims that the ~~fundamental~~ baseline irreversible (nano) wire energy is: ~1 $E_{b} / b i t / n m$ , with $E_{b}$ in the range of 0.1eV (low reliability) to 1eV (high reliability). If reliability is low and each transition is 1E7 nanometers or 1 centimeter, then we need 1E23 EV/second or 1E4 joules/second. IMO this implies that Yudkowsky’s case for a human brain accelerated a million times is as unreliable as Cotra’s case against AI arriving quickly. However, proving that AI is an existential threat is far easier since it requires us to construct an architecture, not to prove that there’s none.
  - Returning to the human brain being far more powerful or efficient, we notice that it can’t, say, be copied infinitely many times. If it could, one could, say, upload a genius physicist and have an army of its copies work on different projects and exchange insights.
  - As for the humans being “wildly more data efficient”, Cannell’s post implies that AlphaGo disproves this conjecture with regards to narrow domains like games. What the humans are wildly more efficient is their ability to handle big contexts and to keep the information in mind for more than a single forward pass, as I discussed here and in the collapsible section here.
  - Max Harms 19 Sep 2025 19:59 UTC
    3 points
    2
    Parent
    Yeah, sorry. I should’ve been more clear. I totally agree that there are ways in which brains are super inefficient and weak. I also agree that on restricted domains it’s possible for current AIs to sometimes reach comparable data efficiency.
- the gears to ascension 20 Sep 2025 16:03 UTC
  2 points
  −5
  Parent
  on the bio side, I’d imagine the transistor equivalent is not a neuron spike, but rather a protein binding
Max Harms 13 Jun 2024 18:19 UTC
6 points
−6
Here are my current thoughts on “human values.” There are a decent number of confusions here, which I’ll try to flag either explicitly or with a (?).
Let’s start with a distribution over possible worlds, where we can split each world into a fixed past and a future function which takes an action.^[1] We also need a policy, which is a sensors -> action function,^[2] where the state of the sensors is drawn from the world’s past.^[3]
Assume that there exists either an obvious channel in many worlds that serves as a source of neutral^[4] information (i.e. helpful for identifying which world the sensor data was drawn from, but “otherwise unimportant in itself”(?)), or that we can modify the actual worlds/context to add this information pathway.
We can now see how the behavior of the policy changes as we increase how informed it is, including possibly at the limit of perfect information. In some policies we should be able to (:confused arm wiggles:) factor out a world modeling step from the policy, which builds a distribution over worlds by updating on the setting of the sensors, and then feeds that distribution to a second sub-function with type world distribution -> action. (We can imagine an idealized policy that, in the limit of perfect information, is able to form a delta-spike on the specific world that its sensor-state was drawn from.) For any given delta-spike on a particular world, we can say that the action this sub-function chooses gives rise to an overall preference for the particular future^[5] selected over the other possible futures. If the overall preferences conform to the VNM axioms we say that the sub-function is a utility function. Relevant features of the world that contribute to high utility scores are “values.”
I think it makes sense to use the word “agent” to refer to policies which can be decomposed into world modelers and utility functions. I also think it makes sense to be a bit less strict in conversation and say that policies which are “almost”(?) able to be decomposed in this way are basically still agents, albeit perhaps less centrally so.
Much of this semi-formalism comes from noticing a subjective division within myself and some of the AI’s I’ve made where it seems natural to say that “this part of the agent is modeling the world” and “this part of the agent is optimizing X according to the world model.” Even though the abstractions seem imperfect, they feel like a good way of gesturing at the structure of my messy sense of how individual humans work. I am almost certainly incoherent in some ways, and I am confused how to rescue the notion of values/utility given that incoherence, but I have a sense that “he’s mostly coherent” can give rise to “he more-or-less values X.”
Two agents can either operate independently or cooperate for some surplus. Ideally there’s a unique way to fairly split the surplus, perhaps using lotteries or some shared currency which they can use to establish units of utility. It seems obvious to me that there are many cooperative arrangements that are decidedly unfair, but I’m pretty confused about whether it’s always possible to establish a fair split (even without lotteries? even without side-payments?) and whether there’s an objective and unique Schelling point for cooperation.
If there is a unique solution, it seems reasonable to me to, given a group of agents, consider the meta-agent that would be formed if each agent committed fully to engaging in fair cooperation. This meta-agent’s action would essentially be an element of the cartesian product of each agent’s action space. In the human context, this story gives rise to a hypothetical set of “human values” which capture the kinds of things that humans optimize for when cooperating.
This seems a bit limited, since it neglects things that real humans optimize for that are part of establishing cooperation (e.g. justice). Does it really make sense to say that justice isn’t a value of human societies because in the fully-cooperative context it’s unnecessary to take justice-affirming actions? (??)
Even when considering a single agent, we can consider the coalition of that agent’s time-slices(?). Like, if we consider Max at t=0 and Max at t=1 as distinct agents, we can consider how they’d behave if they were cooperative with each other. This frame brings in the confusions and complications from group-action, but it also introduces issues such as the nature of future-instances being dependent on past-actions. I have a sense that I only need to cooperate with real-futures, and am free to ignore the desires of unreal-counterfactuals, even if my past/present actions are deciding which futures are real. This almost certainly introduces some fixed-point shenanigans where unrealizing a future is uncooperative with that future but cooperative with the future that becomes realized, and I feel quite uncertain here. More generally, there’s the whole logical-connective stuff from FDT/TDT/UDT.
I currently suspect that if we get a good theory of how to handle partial-coherence, how to handle multi-agent aggregation, and how to handle intertemporal aggregation, then “human values” will shake out to be something like “the mostly-coherent aggregate of all humans that currently exist, and all intertemporal copies of that aggregate” but I might be deeply wrong. :confused wiggles:
1. ^
  The future function either returns a single future state or a distribution over future states. It doesn’t really matter since we can refactor the uncertainty from the distribution over futures into the distribution over worlds.
2. ^
  ″sensors ” is meant to include things like working memories and other introspection.
3. ^
  Similarly to the distribution over futures we can either have a distribution over contexts given a past or we can have a fixed context for a given past and pack the uncertainty into our world distribution. See also anthropics and “bridge laws” and related confusions.
4. ^
  Confusion alert! Sometimes a source of information contains a bias where it’s selected for steering someone who’s listening. I don’t know how to prove an information channel doesn’t have this property, but I do have a sense that neutrality is the default, so I can assume it here without too much trouble.
5. ^
  ..in the context of that particular past! Sometimes the future by itself doesn’t have all the relevant info (e.g. optimizing for the future matching the past).
What links here?
- Max Harms's comment on 3b. Formal (Faux) Corrigibility by Max Harms (13 Jun 2024 18:40 UTC; 2 points)