I was very impressed with the ARC-AGI results so I read the entire paper and also browsed the code a fair amount.
Only after browsing the code I realized that they likely train on all evaluation tasks in addition to training tasks—correct me if I’m wrong. During inference they only condition on x* and on the task embedding to predict y*, instead of on (x,y)_{1..3}. The only way they could get that task embedding is by training it. Evaluation tasks are harder than those in the train set.
They score a lot higher than the baseline transformer so clearly there’s a lot of merit in what they’re doing. But in the setting of training on evaluation tasks you can train on only 1 target ARC-AGI-1 task instead of 1000 tasks and still get 20% accuracy: https://iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_without_pretraining.html. Given this, it doesn’t look earth-shattering.
Yes, I agree—I wasn’t implying that’s foul play. I just thought it’s less impressive than I thought because:
It’s finetuning on task examples and not in-context few-shot learning
They finetune on the harder evaluation set and not only on easier train set, so they don’t demonstrate generalization across the easy->hard distribution shift
The result I linked to was 20% on ARC-AGI-1 by only fitting examples for 1 evaluation task using an MLP-type network vs the 40% result in the paper using 1000 tasks. These numbers are not directly comparable because they did a fair bit of custom architectural engineering to reach 20%, but it really put 40% in perspective for me.