evhub comments on Claude 4

evhub 24 May 2025 2:26 UTC
7 points
12
Why in the world would you use ARC-AGI to measure coding performance? It’s really a pattern-matching task, not a coding task. Also more here:

ARC-AGI probably isn’t a good benchmark for evaluating progress towards TAI: substantial “elicitation” effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks. I am more excited about benchmarks that directly test the ability of AIs to take the role of research scientists and engineers, for example those that METR is developing.