I agree that the results are legit, just taking issue with the authors presenting them without prior work context (e.g. setting the wrong reference class s.t. the improvement over baselines appears larger). RNNs getting outsized performance on maze/sudoku is to be expected and the main ARC result seems to be more of a strong data augmentation + SGD baseline rather than something unique to the architecture, ARC-1 was pretty susceptible to this (eg ARC-AGI Without Pretraining)
This being said I think it’s a big deal that various RNN architectures have such different characteristics on these limit cases for transformers points to a pretty large jump in capabilities when scaling/pretraining is cracked. I think it’d be good for more people working on alignment to be study what types of behaviors are exhibited in these sorts of models at small scale, with the expectation that the paradigm will eventually shift in this direction.
I agree that the results are legit, just taking issue with the authors presenting them without prior work context (e.g. setting the wrong reference class s.t. the improvement over baselines appears larger). RNNs getting outsized performance on maze/sudoku is to be expected and the main ARC result seems to be more of a strong data augmentation + SGD baseline rather than something unique to the architecture, ARC-1 was pretty susceptible to this (eg ARC-AGI Without Pretraining)
This being said I think it’s a big deal that various RNN architectures have such different characteristics on these limit cases for transformers points to a pretty large jump in capabilities when scaling/pretraining is cracked. I think it’d be good for more people working on alignment to be study what types of behaviors are exhibited in these sorts of models at small scale, with the expectation that the paradigm will eventually shift in this direction.
Update: ARC has published a blog post analyzing this, https://arcprize.org/blog/hrm-analysis. As expected swapping in a transformer works approx the same.