tl;dr they find that a very small (7M params) latent reasoning model can outperform much larger models on simple puzzle benchmarks as well as ARC AGI
How to update on this re: whether future LLMs will be latent-reasoners?
The paper shows that small specialist models can outperform large generalist models on niche tasks. I think I would have weakly predicted this before but I am still surprised by the effect size (such a small model, such small data, etc)
Counterpoint: there’s little evidence that the small specialist model can do generic tasks
Counter-counterpoint: then again, this might be solvable via scaffolding e.g. a small, powerful latent reasoning core equipped with powerful tools and knowledge retrieval might outperform a large monolith LLM
I think the counterpoint basically makes the paper instantly become ~0 evidence for the claim that large latent reasoners will exist by the next year, and in general more generic task improvements matter more than specialized task improvements due to the messiness and complexity of reality, and one of my updates over the past 2 years is that RL inference/pre-training scaling dwarfs scaffolding improvements by such large margins that scaffolding quickly becomes worthless, so I no longer consider scaffolded LLMs as a relevant concern/threat.
I’d update back to your prior belief on how likely LLMs will become latent reasoners/have something like neuralese.
I’m retracting the claim that scaffolding doesn’t matter permanently (though admittedly I was biased by stuff like the AutoGPT stuff being no longer talked about, presumably because newer LLMs have completely obsoleted their scaffolding).
Edit: Apparently current RL is just mostly the good version of scaffolding that people thought in 2023, if you believe the paper here.
I no longer consider scaffolded LLMs as a relevant concern/threat.
I am extremely surprised to see you say that, to the point that I think I must be misinterpreting you. What tools an LLM has the ability to use seems to have huge effects on its ability to do things.
Concretely, Claude 3.5 Sonnet can do far more useful coding tasks with a single tool to execute bash commands on a VM than Claude 4.5 Sonnet can in the absence of that tool. Or is “while loop plus tools” not the type of scaffolding you’re referring to?
I’m not sure that reasoning is as applicable to a paper whose central claim is that “less is more” for some tasks? I don’t think the claim is true for training general reasoners, but on the tasks they were looking at they found that larger models would overfit:
We attempted to increase capacity by increasing the number of layers in order to scale the model. Surprisingly, we found that adding layers decreased generalization due to overfitting.
They also say that this is probably because of data scarcity. I think there are many reasons to expect this to not scale, but their attempts failing for this task doesn’t seem like strong evidence.
Yes, but I would have expected that if the method is scalable, even on narrow tasks, they would demonstrate its performance on some task which didn’t saturate at 7M params.
Given that the primary motivation for the author was how well the original HRM paper did on ARC-AGI and how the architecture could be improved, it seems like a reasonable choice to show how to improve the architecture to perform better on the same task.
I agree it’s a small amount of evidence that they didn’t try other tasks, but as is the story seems pretty plausible.
Yes, their goal is to make extremely parameter-efficient tiny models, which is quite different from the goal of making scalable large models. Tiny LMs and LLMs have evolved to have their own sets of techniques. Parameter sharing and recurrence works well for tiny models but increases compute costs a lot for large ones, for example.
Will future LLMs be latent reasoners? Discussion of tiny reasoning models
There’s a recent paper making waves on ML twitter: https://arxiv.org/abs/2510.04871v1
tl;dr they find that a very small (7M params) latent reasoning model can outperform much larger models on simple puzzle benchmarks as well as ARC AGI
How to update on this re: whether future LLMs will be latent-reasoners?
The paper shows that small specialist models can outperform large generalist models on niche tasks. I think I would have weakly predicted this before but I am still surprised by the effect size (such a small model, such small data, etc)
Counterpoint: there’s little evidence that the small specialist model can do generic tasks
Counter-counterpoint: then again, this might be solvable via scaffolding e.g. a small, powerful latent reasoning core equipped with powerful tools and knowledge retrieval might outperform a large monolith LLM
Also worth noting I was already somewhat convinced that future LLMs would be latent-reasoning-by-default: https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=ERhx3w96ueYZGzsiC
Overall a small update for me towards believing future LLMs will be latent reasoners. This has obvious implications for the monitoring agenda.
I think the counterpoint basically makes the paper instantly become ~0 evidence for the claim that large latent reasoners will exist by the next year, and in general more generic task improvements matter more than specialized task improvements due to the messiness and complexity of reality,
and one of my updates over the past 2 years is that RL inference/pre-training scaling dwarfs scaffolding improvements by such large margins that scaffolding quickly becomes worthless, so I no longer consider scaffolded LLMs as a relevant concern/threat.I’d update back to your prior belief on how likely LLMs will become latent reasoners/have something like neuralese.
I’d also be substantially worried about data leakage here.
I’m retracting the claim that scaffolding doesn’t matter permanently (though admittedly I was biased by stuff like the AutoGPT stuff being no longer talked about,
presumably because newer LLMs have completely obsoleted their scaffolding).Edit: Apparently current RL is just mostly the good version of scaffolding that people thought in 2023, if you believe the paper here.
I am extremely surprised to see you say that, to the point that I think I must be misinterpreting you. What tools an LLM has the ability to use seems to have huge effects on its ability to do things.
Concretely, Claude 3.5 Sonnet can do far more useful coding tasks with a single tool to execute bash commands on a VM than Claude 4.5 Sonnet can in the absence of that tool. Or is “while loop plus tools” not the type of scaffolding you’re referring to?
To me the fact that they didn’t scale beyond 7M params, despite trivial compute costs, is evidence this particular strategy doesn’t scale.
I’m not sure that reasoning is as applicable to a paper whose central claim is that “less is more” for some tasks? I don’t think the claim is true for training general reasoners, but on the tasks they were looking at they found that larger models would overfit:
(Section 4.4)
They also say that this is probably because of data scarcity. I think there are many reasons to expect this to not scale, but their attempts failing for this task doesn’t seem like strong evidence.
Yes, but I would have expected that if the method is scalable, even on narrow tasks, they would demonstrate its performance on some task which didn’t saturate at 7M params.
Given that the primary motivation for the author was how well the original HRM paper did on ARC-AGI and how the architecture could be improved, it seems like a reasonable choice to show how to improve the architecture to perform better on the same task.
I agree it’s a small amount of evidence that they didn’t try other tasks, but as is the story seems pretty plausible.
Yes, their goal is to make extremely parameter-efficient tiny models, which is quite different from the goal of making scalable large models. Tiny LMs and LLMs have evolved to have their own sets of techniques. Parameter sharing and recurrence works well for tiny models but increases compute costs a lot for large ones, for example.