Anthropic says they’re both state-of-the-art for coding
Is it a responce to Codex by OpenAI? I can’t wait to see the new models evaluated by ARC-AGI. However, the free version lacks the ability to run the code it generates, which might cost the free version some performance.
Why in the world would you use ARC-AGI to measure coding performance? It’s really a pattern-matching task, not a coding task. Also more here:
ARC-AGI probably isn’t a good benchmark for evaluating progress towards TAI: substantial “elicitation” effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks. I am more excited about benchmarks that directly test the ability of AIs to take the role of research scientists and engineers, for example those that METR is developing.
Is it a responce to Codex by OpenAI? I can’t wait to see the new models evaluated by ARC-AGI. However, the free version lacks the ability to run the code it generates, which might cost the free version some performance.
Why in the world would you use ARC-AGI to measure coding performance? It’s really a pattern-matching task, not a coding task. Also more here: