alexlyzhov comments on I am worried about near-term non-LLM AI developments

alexlyzhov 3 Aug 2025 6:41 UTC
17 points
0
I was very impressed with the ARC-AGI results so I read the entire paper and also browsed the code a fair amount.

Only after browsing the code I realized that they likely train on all evaluation tasks in addition to training tasks—correct me if I’m wrong. During inference they only condition on x* and on the task embedding to predict y*, instead of on (x,y)_{1..3}. The only way they could get that task embedding is by training it. Evaluation tasks are harder than those in the train set.

They score a lot higher than the baseline transformer so clearly there’s a lot of merit in what they’re doing. But in the setting of training on evaluation tasks you can train on only 1 target ARC-AGI-1 task instead of 1000 tasks and still get 20% accuracy: https://iliao2345.github.io/blog_posts/arc_agi_without_pretraining/arc_agi_without_pretraining.html. Given this, it doesn’t look earth-shattering.
What links here?
- Noosphere89's comment on I am worried about near-term non-LLM AI developments by testingthewaters (5 Aug 2025 19:54 UTC; 11 points)
- Kaj_Sotala 3 Aug 2025 10:26 UTC
  8 points
  0
  Parent
  It does sound like it, someone on Twitter pointed out this page 12 of the HRM paper (emphasis added):
  For ARC-AGI challenge, we start with all input-output example pairs in the training and the evaluation sets. The dataset is augmented by applying translations, rotations, flips, and color permutations to the puzzles. Each task example is prepended with a learnable special token that represents the puzzle it belongs to. At test time, we proceed as follows for each test input in the evaluation set: (1) Generate and solve 1000 augmented variants and, for each, apply the inverse-augmentation transform to obtain a prediction. (2) Choose the two most popular predictions as the final outputs.3 All results are reported on the evaluation set.
  Which sounds like they used the initial training and evaluation set for training, then generated a new set of “augmented variants” of the original puzzles for the actual evaluation. So when the model is evaluated, it knows that “this is a variant of puzzle X which I’ve seen before”.
  EDIT: Though also somebody on Twitter, it sounds like this might be just the ARC-AGI materials being confusingly named?
  No foul play here. The ARC-AGI dataset is confusing because it has 2 different kinds of “train” sets. There are train set puzzles and train input-output examples within each puzzle. They trained on only the train examples of the train and validation puzzle sets, which is fair. [...]
  It’s “all input-output example pairs” not “all input-output pairs”. The examples are the training data in ARC-AGI.
  - alexlyzhov 3 Aug 2025 17:39 UTC
    7 points
    3
    Parent
    They trained on only the train examples of the train and validation puzzle sets, which is fair.
    
    Yes, I agree—I wasn’t implying that’s foul play. I just thought it’s less impressive than I thought because:
    
    It’s finetuning on task examples and not in-context few-shot learning
    They finetune on the harder evaluation set and not only on easier train set, so they don’t demonstrate generalization across the easy->hard distribution shift
    The result I linked to was 20% on ARC-AGI-1 by only fitting examples for 1 evaluation task using an MLP-type network vs the 40% result in the paper using 1000 tasks. These numbers are not directly comparable because they did a fair bit of custom architectural engineering to reach 20%, but it really put 40% in perspective for me.