I’m not sure that reasoning is as applicable to a paper whose central claim is that “less is more” for some tasks? I don’t think the claim is true for training general reasoners, but on the tasks they were looking at they found that larger models would overfit:
We attempted to increase capacity by increasing the number of layers in order to scale the model. Surprisingly, we found that adding layers decreased generalization due to overfitting.
They also say that this is probably because of data scarcity. I think there are many reasons to expect this to not scale, but their attempts failing for this task doesn’t seem like strong evidence.
Given that the primary motivation for the author was how well the original HRM paper did on ARC-AGI and how the architecture could be improved, it seems like a reasonable choice to show how to improve the architecture to perform better on the same task.
I agree it’s a small amount of evidence that they didn’t try other tasks, but as is the story seems pretty plausible.