I don’t know why it’s reporting the current SWE-Bench SOTA as 60%? The official leaderboard has Claude 4 Sonnet at 65%. (Epoch has Claude 4 Opus at 62%, but the originalPost linked the official leaderboard.) Of course either way it’s far below 85%, unless you allow for best-of-N sampling via a scoring model, which brings Claude 4 up to 80%.
I don’t know why it’s reporting the current SWE-Bench SOTA as 60%? The official leaderboard has Claude 4 Sonnet at 65%. (Epoch has Claude 4 Opus at 62%, but the originalPost linked the official leaderboard.) Of course either way it’s far below 85%, unless you allow for best-of-N sampling via a scoring model, which brings Claude 4 up to 80%.
Updated with a note.