Cole Wyeth comments on I am worried about near-term non-LLM AI developments

Cole Wyeth 31 Jul 2025 14:14 UTC
23 points
12
I think you haven’t given sufficient reason to expect your predictions to come true.
Most things don’t scale. Why should we expect HRMs to be an exception?
- testingthewaters 31 Jul 2025 14:23 UTC
  11 points
  −12
  Parent
  Mostly because the efficiencies being claimed here are truly staggering. HRM claims to have SOTA performance on ARC-AGI problem sets with 27 million parameters and 1000 training examples (with no general pretraining!). O3-mini-high, the next model on the board, has a parameter count on the order of hundreds of billions and was trained on the entire internet. This is a 3700x efficiency improvement in param count and probably more than 10000x improvement in data efficiency. At this scale even if the scaling peters out at 1000x you are talking about an entirely new paradigm, especially now that there is a partial replication as well.
  - Cole Wyeth 31 Jul 2025 14:29 UTC
    18 points
    7
    Parent
    Well, o3 wasn’t optimized to perform well on ARC-AGI as its primary purpose—is this a fair comparison?
    - testingthewaters 31 Jul 2025 14:31 UTC
      3 points
      2
      Parent
      I’ve also been tracking the development of these kinds of techniques for a year, and they have consistently been showing surprising improvements. The test time training people have been publishing for a while, and the ceiling seems nowhere in sight.
      
      To be honest a complete recounting of why I believe this is somewhat beyond a comment’s length. I also feel very strongly that progress is accelerating, and that if I don’t say something now there will not be much time to react. Hence the post.
      - Cole Wyeth 31 Jul 2025 14:48 UTC
        16 points
        12
        Parent
        Perhaps it’s beyond the length of a comment, but why not recount it in the post?
        testingthewaters 31 Jul 2025 14:52 UTC
        10 points
        3
        Parent
        I think that the most legible arguments are already in the post. The only thing I would make clearer is that the human brain is an existence proof that such highly efficient continuous learning algorithms exist, and therefore I see the development of these models as not particularly surprising. Stephen Brynes has some similar intuitions in his Foom and Doom series.
        Cole Wyeth 31 Jul 2025 14:56 UTC
        9 points
        3
        Parent
        I read (some of) that series and disagreed with his assessment on the same grounds.
        testingthewaters 31 Jul 2025 15:11 UTC
        5 points
        2
        Parent
        In which case, the best thing is probably for us to wait and see if the predictions come to pass, if that’s okay with you. I might also be afk for a bit so might not be able to immediately reply to any further comments.
  - Random Developer 31 Jul 2025 21:32 UTC
    7 points
    0
    Parent
    The other thing that seems strange here is that the parameter counts are far, far below those of the human brain. I mean, yeah, o3 is probably 1 or 2 OOM below the brain, but that could be explained away by the fact that o3 is still missing some very basic human abilities, and perhaps it has found interesting efficiencies.
    
    But 27 million parameters is just so many OOM below the human brain that I’m expecting a gotcha. Even at a 60-bit quantization, that’s only about 54MB. Which means that the model’s world knowledge must be ridiculously limited compared to even simple LLMs. Maybe if it’s a specialist model that’s tuned for just a handful of very specific benchmarks?
    - Steven Byrnes 1 Aug 2025 1:15 UTC
      41 points
      12
      Parent
      “An HRM model trained to do ARC-AGI and nothing else” seems vaguely analogous to “An AlphaZero model trained to play Go and nothing else”. Right?
      If you buy that, then I would note that HRM & AlphaZero probably have quite similar numbers of parameters (27M vs maybe 23M, see footnote here). And AlphaZero was just a plain old ResNet, I think.
      So I don’t see “HRM solves ARC-AGI with 27M parameters” as evidence for HRM having unusual parameter efficiency. Right? Sorry if I’m misunderstanding.
      - RussellThor 1 Aug 2025 3:19 UTC
        2 points
        0
        Parent
        It is weak evidence, we simply won’t know until we scale it up. If it is automatically good at 3d spatial understanding with extra scale up, then that starts to become evidence it has better scaling properties. (To me it is clear that LLM/Transformers won’t scale to AGI, xAI already has close to maxed out scaling and Tesla autopilot probably does everything mostly right but is far less data efficient than people)
    - ACCount 31 Jul 2025 22:30 UTC
      12 points
      5
      Parent
      ARC-AGI, v1 and v2 both, is a very spatial-reasoning-shaped problem. And LLMs are not very spatial-reasoning-shaped.
      It could be that this arch is an unusually good fit for spatial reasoning problems, and a poor fit for others.
      We haven’t seen it used for either text generation or image generation, both of which are hot topics in AI right now. Which is very weak evidence that it’s unsuitable for that kind of task. And much stronger evidence that the authors couldn’t get it to work on this kind of task.
      - Kaj_Sotala 1 Aug 2025 18:21 UTC
        7 points
        1
        Parent
        I wonder how much we need to worry about hybrid architectures. If LLMs do text generation well and continuous learning models do spatial reasoning well, and someone figures out an architecture that lets their strengths synergize with each other...
        testingthewaters 1 Aug 2025 19:03 UTC
        6 points
        0
        Parent
        That is the basic idea behind Energy based transformers and test time training!
      - RussellThor 1 Aug 2025 3:16 UTC
        7 points
        3
        Parent
        OK our intelligence is very spatial-reasoning shaped. Bio architecture can’t do language until it has many params. If it is terrible at text or image gen that isn’t evidence it won’t in fact scale to AGI and best Transformers with more compute. We simply won’t know until it is scaled up.
    - james oofou 1 Aug 2025 8:35 UTC
      9 points
      5
      Parent
      Doesn’t the human brain’s structure provide something closer to an upper bound rather than a lower bound on the number of parameters required for higher reasoning?
      Higher reasoning evolved in humans over a short period of time. And it is speculated that it was mostly arrived at simply by scaling up chimp brains.
      This implies that our brains are very far from optimised for higher reasoning, so we should expect that to whatever extent factors other than scale can contribute to higher-reasoning ability, it is possible for brains smaller than our own to engage in higher reasoning.
      The human brain should be seen as evidence that a certain scale is ~sufficient, but not that it is necessary.
      - Random Developer 2 Aug 2025 13:27 UTC
        11 points
        6
        Parent
        
        The human brain should be seen as evidence that a certain scale is ~sufficient, but not that it is necessary.
        
        The human brain is often estimated to have 10^14 synapses, which would be a 100T model, give or take. Except that individual neurons also have a bunch of internal parameters, which might complicate things.
        
        If you told me that the human brain was massively inefficient, and that you had squeezed human level AGI into 1T parameters, I would be only mildly surprised.
        
        For that matter, if you told me you had squeezed a weak AGI into 30B parameters, I’d be interested in the claim. Qwen3 really is surprisingly capable in that size range. If you told me 4B, I’d be very skeptical, but then again, Gemma 3n does implausibly well on my diverse private benchmarks, and it’s technically multi-modal. At the very least, I’d accept it as the premise of a science fiction horror story about tiny, unaligned AIs.
        
        But if we drop all the way to 30 million parameters, I am profoundly suspicious of any kind of general model with language skills and reasonable world knowledge. Even if you store language and world knowledge as compressed text files, you’re going to his some pretty hard limits at that size. That’s a 60MB ZIP file or less. You’d be taking about needing only ¹⁄_3,000,000th of parameters of the brain. Which is a lot of orders of magnitude.
        
        At that size, I’m assuming that any kind of genuinely interesting model would be something like AlphaGo, that demonstrates impressive knowledge and learning abilities in a very narrow domain. Which is fine! It might even be the final warning that AGI is inevitable. But I would still expect more than 6 months would be required to scale back up from such a tiny model to something with general world knowledge, language and common sense.
      - Vladimir_Nesov 1 Aug 2025 18:58 UTC
        3 points
        1
        Parent
        
        Higher reasoning evolved in humans over a short period of time. … The human brain should be seen as evidence that a certain scale is ~sufficient, but not that it is necessary.
        
        We can still see that a chimp scale brain with this architecture isn’t sufficient, and human-built AI architectures were also only developed over a short period of time. Backprop and large scale training in parallel for one individual might give AIs an advantage that chimp/human brains don’t have, but unclear if this overcomes the widely applicable unhobbling from the much longer efforts by evolution to build minds for efficient online learning robots.
    - testingthewaters 31 Jul 2025 22:26 UTC
      3 points
      0
      Parent
      To be clear, I don’t think that HRM with 27 million params will be natural language capable. However, if my assumptions are correct, a scaled up version of HRM should be able to attain performance similar to frontier models while learning online and being relatively much smaller in size (1-2 OOMs smaller, based on a rough hunch).