Adam Karvonen comments on METR Research Update: Algorithmic vs. Holistic Evaluation

Adam Karvonen 14 Aug 2025 18:49 UTC
9 points
7
I would be interested to see this experiment replicated with a different model, like Claude 4.1 Opus or GPT-5. Claude 3.7 Sonnet is perhaps the most notorious LLM in terms of ignoring user intent and propensity to reward hack when writing code.