Details A small number of members of technical staff spent over 2 hours deliberately evaluating Claude Sonnet 4.5’s ability to do their own AI R&D tasks. They took notes and kept transcripts on strengths and weaknesses, and then generated productivity uplift estimates. They were directly asked if this model could completely automate a junior ML researcher. …
Claude Sonnet 4.5 results When asked about their experience with using early snapshots of Claude Sonnet 4.5 in the weeks leading up to deployment, 0⁄7 researchers believed that the model could completely automate the work of a junior ML researcher. One participant estimated an overall productivity boost of ~100%, and indicated that their workflow was now mainly focused on managing multiple agents. Other researcher acceleration estimates were 15%, 20%, 20%, 30%, 40%, with one report of qualitative-only feedback. Four of 7 participants indicated that most of the productivity boost was attributable to Claude Code, and not to the capabilities delta between Claude Opus 4.1 and (early) Claude Sonnet 4.5.
From the Sonnet 4.5 system card, felt relevant:
(Found this via Ryan Greenblatt’s recent post)