They trained on only the train examples of the train and validation puzzle sets, which is fair.
Yes, I agree—I wasn’t implying that’s foul play. I just thought it’s less impressive than I thought because:
It’s finetuning on task examples and not in-context few-shot learning
They finetune on the harder evaluation set and not only on easier train set, so they don’t demonstrate generalization across the easy->hard distribution shift
The result I linked to was 20% on ARC-AGI-1 by only fitting examples for 1 evaluation task using an MLP-type network vs the 40% result in the paper using 1000 tasks. These numbers are not directly comparable because they did a fair bit of custom architectural engineering to reach 20%, but it really put 40% in perspective for me.
Yes, I agree—I wasn’t implying that’s foul play. I just thought it’s less impressive than I thought because:
It’s finetuning on task examples and not in-context few-shot learning
They finetune on the harder evaluation set and not only on easier train set, so they don’t demonstrate generalization across the easy->hard distribution shift
The result I linked to was 20% on ARC-AGI-1 by only fitting examples for 1 evaluation task using an MLP-type network vs the 40% result in the paper using 1000 tasks. These numbers are not directly comparable because they did a fair bit of custom architectural engineering to reach 20%, but it really put 40% in perspective for me.