alexlyzhov comments on I am worried about near-term non-LLM AI developments

alexlyzhov 3 Aug 2025 17:39 UTC
7 points
3
They trained on only the train examples of the train and validation puzzle sets, which is fair.

Yes, I agree—I wasn’t implying that’s foul play. I just thought it’s less impressive than I thought because:
- It’s finetuning on task examples and not in-context few-shot learning
- They finetune on the harder evaluation set and not only on easier train set, so they don’t demonstrate generalization across the easy->hard distribution shift
- The result I linked to was 20% on ARC-AGI-1 by only fitting examples for 1 evaluation task using an MLP-type network vs the 40% result in the paper using 1000 tasks. These numbers are not directly comparable because they did a fair bit of custom architectural engineering to reach 20%, but it really put 40% in perspective for me.