Yes, but I would have expected that if the method is scalable, even on narrow tasks, they would demonstrate its performance on some task which didn’t saturate at 7M params.
Given that the primary motivation for the author was how well the original HRM paper did on ARC-AGI and how the architecture could be improved, it seems like a reasonable choice to show how to improve the architecture to perform better on the same task.
I agree it’s a small amount of evidence that they didn’t try other tasks, but as is the story seems pretty plausible.
Yes, but I would have expected that if the method is scalable, even on narrow tasks, they would demonstrate its performance on some task which didn’t saturate at 7M params.
Given that the primary motivation for the author was how well the original HRM paper did on ARC-AGI and how the architecture could be improved, it seems like a reasonable choice to show how to improve the architecture to perform better on the same task.
I agree it’s a small amount of evidence that they didn’t try other tasks, but as is the story seems pretty plausible.