Mostly because the efficiencies being claimed here are truly staggering. HRM claims to have SOTA performance on ARC-AGI problem sets with 27 million parameters and 1000 training examples (with no general pretraining!). O3-mini-high, the next model on the board, has a parameter count on the order of hundreds of billions and was trained on the entire internet. This is a 3700x efficiency improvement in param count and probably more than 10000x improvement in data efficiency. At this scale even if the scaling peters out at 1000x you are talking about an entirely new paradigm, especially now that there is a partial replication as well.
I’ve also been tracking the development of these kinds of techniques for a year, and they have consistently been showing surprising improvements. The test time training people have been publishing for a while, and the ceiling seems nowhere in sight.
To be honest a complete recounting of why I believe this is somewhat beyond a comment’s length. I also feel very strongly that progress is accelerating, and that if I don’t say something now there will not be much time to react. Hence the post.
I think that the most legible arguments are already in the post. The only thing I would make clearer is that the human brain is an existence proof that such highly efficient continuous learning algorithms exist, and therefore I see the development of these models as not particularly surprising. Stephen Brynes has some similar intuitions in his Foom and Doom series.
In which case, the best thing is probably for us to wait and see if the predictions come to pass, if that’s okay with you. I might also be afk for a bit so might not be able to immediately reply to any further comments.
The other thing that seems strange here is that the parameter counts are far, far below those of the human brain. I mean, yeah, o3 is probably 1 or 2 OOM below the brain, but that could be explained away by the fact that o3 is still missing some very basic human abilities, and perhaps it has found interesting efficiencies.
But 27 million parameters is just so many OOM below the human brain that I’m expecting a gotcha. Even at a 60-bit quantization, that’s only about 54MB. Which means that the model’s world knowledge must be ridiculously limited compared to even simple LLMs. Maybe if it’s a specialist model that’s tuned for just a handful of very specific benchmarks?
“An HRM model trained to do ARC-AGI and nothing else” seems vaguely analogous to “An AlphaZero model trained to play Go and nothing else”. Right?
If you buy that, then I would note that HRM & AlphaZero probably have quite similar numbers of parameters (27M vs maybe 23M, see footnote here). And AlphaZero was just a plain old ResNet, I think.
So I don’t see “HRM solves ARC-AGI with 27M parameters” as evidence for HRM having unusual parameter efficiency. Right? Sorry if I’m misunderstanding.
It is weak evidence, we simply won’t know until we scale it up. If it is automatically good at 3d spatial understanding with extra scale up, then that starts to become evidence it has better scaling properties. (To me it is clear that LLM/Transformers won’t scale to AGI, xAI already has close to maxed out scaling and Tesla autopilot probably does everything mostly right but is far less data efficient than people)
ARC-AGI, v1 and v2 both, is a very spatial-reasoning-shaped problem. And LLMs are not very spatial-reasoning-shaped.
It could be that this arch is an unusually good fit for spatial reasoning problems, and a poor fit for others.
We haven’t seen it used for either text generation or image generation, both of which are hot topics in AI right now. Which is very weak evidence that it’s unsuitable for that kind of task. And much stronger evidence that the authors couldn’t get it to work on this kind of task.
I wonder how much we need to worry about hybrid architectures. If LLMs do text generation well and continuous learning models do spatial reasoning well, and someone figures out an architecture that lets their strengths synergize with each other...
OK our intelligence is very spatial-reasoning shaped. Bio architecture can’t do language until it has many params. If it is terrible at text or image gen that isn’t evidence it won’t in fact scale to AGI and best Transformers with more compute. We simply won’t know until it is scaled up.
Doesn’t the human brain’s structure provide something closer to an upper bound rather than a lower bound on the number of parameters required for higher reasoning?
Higher reasoning evolved in humans over a short period of time. And it is speculated that it was mostly arrived at simply by scaling up chimp brains.
This implies that our brains are very far from optimised for higher reasoning, so we should expect that to whatever extent factors other than scale can contribute to higher-reasoning ability, it is possible for brains smaller than our own to engage in higher reasoning.
The human brain should be seen as evidence that a certain scale is ~sufficient, but not that it is necessary.
The human brain should be seen as evidence that a certain scale is ~sufficient, but not that it is necessary.
The human brain is often estimated to have 10^14 synapses, which would be a 100T model, give or take. Except that individual neurons also have a bunch of internal parameters, which might complicate things.
If you told me that the human brain was massively inefficient, and that you had squeezed human level AGI into 1T parameters, I would be only mildly surprised.
For that matter, if you told me you had squeezed a weak AGI into 30B parameters, I’d be interested in the claim. Qwen3 really is surprisingly capable in that size range. If you told me 4B, I’d be very skeptical, but then again, Gemma 3n does implausibly well on my diverse private benchmarks, and it’s technically multi-modal. At the very least, I’d accept it as the premise of a science fiction horror story about tiny, unaligned AIs.
But if we drop all the way to 30 million parameters, I am profoundly suspicious of any kind of general model with language skills and reasonable world knowledge. Even if you store language and world knowledge as compressed text files, you’re going to his some pretty hard limits at that size. That’s a 60MB ZIP file or less. You’d be taking about needing only 1⁄3,000,000th of parameters of the brain. Which is a lot of orders of magnitude.
At that size, I’m assuming that any kind of genuinely interesting model would be something like AlphaGo, that demonstrates impressive knowledge and learning abilities in a very narrow domain. Which is fine! It might even be the final warning that AGI is inevitable. But I would still expect more than 6 months would be required to scale back up from such a tiny model to something with general world knowledge, language and common sense.
Higher reasoning evolved in humans over a short period of time. … The human brain should be seen as evidence that a certain scale is ~sufficient, but not that it is necessary.
We can still see that a chimp scale brain with this architecture isn’t sufficient, and human-built AI architectures were also only developed over a short period of time. Backprop and large scale training in parallel for one individual might give AIs an advantage that chimp/human brains don’t have, but unclear if this overcomes the widely applicable unhobbling from the much longer efforts by evolution to build minds for efficient online learning robots.
To be clear, I don’t think that HRM with 27 million params will be natural language capable. However, if my assumptions are correct, a scaled up version of HRM should be able to attain performance similar to frontier models while learning online and being relatively much smaller in size (1-2 OOMs smaller, based on a rough hunch).
I think you haven’t given sufficient reason to expect your predictions to come true.
Most things don’t scale. Why should we expect HRMs to be an exception?
Mostly because the efficiencies being claimed here are truly staggering. HRM claims to have SOTA performance on ARC-AGI problem sets with 27 million parameters and 1000 training examples (with no general pretraining!). O3-mini-high, the next model on the board, has a parameter count on the order of hundreds of billions and was trained on the entire internet. This is a 3700x efficiency improvement in param count and probably more than 10000x improvement in data efficiency. At this scale even if the scaling peters out at 1000x you are talking about an entirely new paradigm, especially now that there is a partial replication as well.
Well, o3 wasn’t optimized to perform well on ARC-AGI as its primary purpose—is this a fair comparison?
I’ve also been tracking the development of these kinds of techniques for a year, and they have consistently been showing surprising improvements. The test time training people have been publishing for a while, and the ceiling seems nowhere in sight.
To be honest a complete recounting of why I believe this is somewhat beyond a comment’s length. I also feel very strongly that progress is accelerating, and that if I don’t say something now there will not be much time to react. Hence the post.
Perhaps it’s beyond the length of a comment, but why not recount it in the post?
I think that the most legible arguments are already in the post. The only thing I would make clearer is that the human brain is an existence proof that such highly efficient continuous learning algorithms exist, and therefore I see the development of these models as not particularly surprising. Stephen Brynes has some similar intuitions in his Foom and Doom series.
I read (some of) that series and disagreed with his assessment on the same grounds.
In which case, the best thing is probably for us to wait and see if the predictions come to pass, if that’s okay with you. I might also be afk for a bit so might not be able to immediately reply to any further comments.
The other thing that seems strange here is that the parameter counts are far, far below those of the human brain. I mean, yeah, o3 is probably 1 or 2 OOM below the brain, but that could be explained away by the fact that o3 is still missing some very basic human abilities, and perhaps it has found interesting efficiencies.
But 27 million parameters is just so many OOM below the human brain that I’m expecting a gotcha. Even at a 60-bit quantization, that’s only about 54MB. Which means that the model’s world knowledge must be ridiculously limited compared to even simple LLMs. Maybe if it’s a specialist model that’s tuned for just a handful of very specific benchmarks?
“An HRM model trained to do ARC-AGI and nothing else” seems vaguely analogous to “An AlphaZero model trained to play Go and nothing else”. Right?
If you buy that, then I would note that HRM & AlphaZero probably have quite similar numbers of parameters (27M vs maybe 23M, see footnote here). And AlphaZero was just a plain old ResNet, I think.
So I don’t see “HRM solves ARC-AGI with 27M parameters” as evidence for HRM having unusual parameter efficiency. Right? Sorry if I’m misunderstanding.
It is weak evidence, we simply won’t know until we scale it up. If it is automatically good at 3d spatial understanding with extra scale up, then that starts to become evidence it has better scaling properties. (To me it is clear that LLM/Transformers won’t scale to AGI, xAI already has close to maxed out scaling and Tesla autopilot probably does everything mostly right but is far less data efficient than people)
ARC-AGI, v1 and v2 both, is a very spatial-reasoning-shaped problem. And LLMs are not very spatial-reasoning-shaped.
It could be that this arch is an unusually good fit for spatial reasoning problems, and a poor fit for others.
We haven’t seen it used for either text generation or image generation, both of which are hot topics in AI right now. Which is very weak evidence that it’s unsuitable for that kind of task. And much stronger evidence that the authors couldn’t get it to work on this kind of task.
I wonder how much we need to worry about hybrid architectures. If LLMs do text generation well and continuous learning models do spatial reasoning well, and someone figures out an architecture that lets their strengths synergize with each other...
That is the basic idea behind Energy based transformers and test time training!
OK our intelligence is very spatial-reasoning shaped. Bio architecture can’t do language until it has many params. If it is terrible at text or image gen that isn’t evidence it won’t in fact scale to AGI and best Transformers with more compute. We simply won’t know until it is scaled up.
Doesn’t the human brain’s structure provide something closer to an upper bound rather than a lower bound on the number of parameters required for higher reasoning?
Higher reasoning evolved in humans over a short period of time. And it is speculated that it was mostly arrived at simply by scaling up chimp brains.
This implies that our brains are very far from optimised for higher reasoning, so we should expect that to whatever extent factors other than scale can contribute to higher-reasoning ability, it is possible for brains smaller than our own to engage in higher reasoning.
The human brain should be seen as evidence that a certain scale is ~sufficient, but not that it is necessary.
The human brain is often estimated to have 10^14 synapses, which would be a 100T model, give or take. Except that individual neurons also have a bunch of internal parameters, which might complicate things.
If you told me that the human brain was massively inefficient, and that you had squeezed human level AGI into 1T parameters, I would be only mildly surprised.
For that matter, if you told me you had squeezed a weak AGI into 30B parameters, I’d be interested in the claim. Qwen3 really is surprisingly capable in that size range. If you told me 4B, I’d be very skeptical, but then again, Gemma 3n does implausibly well on my diverse private benchmarks, and it’s technically multi-modal. At the very least, I’d accept it as the premise of a science fiction horror story about tiny, unaligned AIs.
But if we drop all the way to 30 million parameters, I am profoundly suspicious of any kind of general model with language skills and reasonable world knowledge. Even if you store language and world knowledge as compressed text files, you’re going to his some pretty hard limits at that size. That’s a 60MB ZIP file or less. You’d be taking about needing only 1⁄3,000,000th of parameters of the brain. Which is a lot of orders of magnitude.
At that size, I’m assuming that any kind of genuinely interesting model would be something like AlphaGo, that demonstrates impressive knowledge and learning abilities in a very narrow domain. Which is fine! It might even be the final warning that AGI is inevitable. But I would still expect more than 6 months would be required to scale back up from such a tiny model to something with general world knowledge, language and common sense.
We can still see that a chimp scale brain with this architecture isn’t sufficient, and human-built AI architectures were also only developed over a short period of time. Backprop and large scale training in parallel for one individual might give AIs an advantage that chimp/human brains don’t have, but unclear if this overcomes the widely applicable unhobbling from the much longer efforts by evolution to build minds for efficient online learning robots.
To be clear, I don’t think that HRM with 27 million params will be natural language capable. However, if my assumptions are correct, a scaled up version of HRM should be able to attain performance similar to frontier models while learning online and being relatively much smaller in size (1-2 OOMs smaller, based on a rough hunch).