I was very impressed with the ARC-AGI results so I read the entire paper and also browsed the code a fair amount.
Only after browsing the code I realized that they likely train on all evaluation tasks in addition to training tasks—correct me if I’m wrong. During inference they only condition on x* and on the task embedding to predict y*, instead of on (x,y)_{1..3}. The only way they could get that task embedding is by training it. Evaluation tasks are harder than those in the train set.
It does sound like it, someone on Twitter pointed out this page 12 of the HRM paper (emphasis added):
For ARC-AGI challenge, we start with all input-output example pairs in the training and the evaluation sets. The dataset is augmented by applying translations, rotations, flips, and color permutations to the puzzles. Each task example is prepended with a learnable special token that represents the puzzle it belongs to. At test time, we proceed as follows for each test input in the evaluation set: (1) Generate and solve 1000 augmented variants and, for each, apply the inverse-augmentation transform to obtain a prediction. (2) Choose the two most popular predictions as the final outputs.3 All results are reported on the evaluation set.
Which sounds like they used the initial training and evaluation set for training, then generated a new set of “augmented variants” of the original puzzles for the actual evaluation. So when the model is evaluated, it knows that “this is a variant of puzzle X which I’ve seen before”.
EDIT: Though also somebody on Twitter, it sounds like this might be just the ARC-AGI materials being confusingly named?
No foul play here. The ARC-AGI dataset is confusing because it has 2 different kinds of “train” sets. There are train set puzzles and train input-output examples within each puzzle. They trained on only the train examples of the train and validation puzzle sets, which is fair. [...]
It’s “all input-output example pairs” not “all input-output pairs”. The examples are the training data in ARC-AGI.
They trained on only the train examples of the train and validation puzzle sets, which is fair.
Yes, I agree—I wasn’t implying that’s foul play. I just thought it’s less impressive than I thought because:
It’s finetuning on task examples and not in-context few-shot learning
They finetune on the harder evaluation set and not only on easier train set, so they don’t demonstrate generalization across the easy->hard distribution shift
The result I linked to was 20% on ARC-AGI-1 by only fitting examples for 1 evaluation task using an MLP-type network vs the 40% result in the paper using 1000 tasks. These numbers are not directly comparable because they did a fair bit of custom architectural engineering to reach 20%, but it really put 40% in perspective for me.
I was very impressed with the ARC-AGI results so I read the entire paper and also browsed the code a fair amount.
Only after browsing the code I realized that they likely train on all evaluation tasks in addition to training tasks—correct me if I’m wrong. During inference they only condition on x* and on the task embedding to predict y*, instead of on (x,y)_{1..3}. The only way they could get that task embedding is by training it. Evaluation tasks are harder than those in the train set.
They score a lot higher than the baseline transformer so clearly there’s a lot of merit in what they’re doing. But in the setting of training on evaluation tasks you can train on only 1 target ARC-AGI-1 task instead of 1000 tasks and still get 20% accuracy: https://iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_without_pretraining.html. Given this, it doesn’t look earth-shattering.
It does sound like it, someone on Twitter pointed out this page 12 of the HRM paper (emphasis added):
Which sounds like they used the initial training and evaluation set for training, then generated a new set of “augmented variants” of the original puzzles for the actual evaluation. So when the model is evaluated, it knows that “this is a variant of puzzle X which I’ve seen before”.
EDIT: Though also somebody on Twitter, it sounds like this might be just the ARC-AGI materials being confusingly named?
Yes, I agree—I wasn’t implying that’s foul play. I just thought it’s less impressive than I thought because:
It’s finetuning on task examples and not in-context few-shot learning
They finetune on the harder evaluation set and not only on easier train set, so they don’t demonstrate generalization across the easy->hard distribution shift
The result I linked to was 20% on ARC-AGI-1 by only fitting examples for 1 evaluation task using an MLP-type network vs the 40% result in the paper using 1000 tasks. These numbers are not directly comparable because they did a fair bit of custom architectural engineering to reach 20%, but it really put 40% in perspective for me.