Why in the world would you use ARC-AGI to measure coding performance? It’s really a pattern-matching task, not a coding task. Also more here:
ARC-AGI probably isn’t a good benchmark for evaluating progress towards TAI: substantial “elicitation” effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks. I am more excited about benchmarks that directly test the ability of AIs to take the role of research scientists and engineers, for example those that METR is developing.
Why in the world would you use ARC-AGI to measure coding performance? It’s really a pattern-matching task, not a coding task. Also more here: