StanislavKrym comments on Claude 4

StanislavKrym 24 May 2025 2:08 UTC
−4 points
−13
Anthropic says they’re both state-of-the-art for coding
Is it a responce to Codex by OpenAI? I can’t wait to see the new models evaluated by ARC-AGI. However, the free version lacks the ability to run the code it generates, which might cost the free version some performance.
- evhub 24 May 2025 2:26 UTC
  7 points
  12
  Parent
  Why in the world would you use ARC-AGI to measure coding performance? It’s really a pattern-matching task, not a coding task. Also more here:
  
  ARC-AGI probably isn’t a good benchmark for evaluating progress towards TAI: substantial “elicitation” effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks. I am more excited about benchmarks that directly test the ability of AIs to take the role of research scientists and engineers, for example those that METR is developing.