I agree that the paper is not terribly novel. There is no new maths here or radical new technique. If you wanted to be reductive you could just call it two RNN modules stacked on top of each other run at different clock speeds. But the fact that this simple setup is enough to trigger such outsized improvements (which have been partially replicated externally ) is what is alarming to me.
I agree that the results are legit, just taking issue with the authors presenting them without prior work context (e.g. setting the wrong reference class s.t. the improvement over baselines appears larger). RNNs getting outsized performance on maze/sudoku is to be expected and the main ARC result seems to be more of a strong data augmentation + SGD baseline rather than something unique to the architecture, ARC-1 was pretty susceptible to this (eg ARC-AGI Without Pretraining)
This being said I think it’s a big deal that various RNN architectures have such different characteristics on these limit cases for transformers points to a pretty large jump in capabilities when scaling/pretraining is cracked. I think it’d be good for more people working on alignment to be study what types of behaviors are exhibited in these sorts of models at small scale, with the expectation that the paradigm will eventually shift in this direction.
I agree that the paper is not terribly novel. There is no new maths here or radical new technique. If you wanted to be reductive you could just call it two RNN modules stacked on top of each other run at different clock speeds. But the fact that this simple setup is enough to trigger such outsized improvements (which have been partially replicated externally ) is what is alarming to me.
I agree that the results are legit, just taking issue with the authors presenting them without prior work context (e.g. setting the wrong reference class s.t. the improvement over baselines appears larger). RNNs getting outsized performance on maze/sudoku is to be expected and the main ARC result seems to be more of a strong data augmentation + SGD baseline rather than something unique to the architecture, ARC-1 was pretty susceptible to this (eg ARC-AGI Without Pretraining)
This being said I think it’s a big deal that various RNN architectures have such different characteristics on these limit cases for transformers points to a pretty large jump in capabilities when scaling/pretraining is cracked. I think it’d be good for more people working on alignment to be study what types of behaviors are exhibited in these sorts of models at small scale, with the expectation that the paradigm will eventually shift in this direction.
Update: ARC has published a blog post analyzing this, https://arcprize.org/blog/hrm-analysis. As expected swapping in a transformer works approx the same.