I also did some normal searches to verify but I think this LLM summary is pretty good. I misread one of the LLM’s tables, so the summary looks unreliable enough that I deleted it but normal searches still support my first comment, Erich posts some relevant links.
I don’t know why it’s reporting the current SWE-Bench SOTA as 60%? The official leaderboard has Claude 4 Sonnet at 65%. (Epoch has Claude 4 Opus at 62%, but the originalPost linked the official leaderboard.) Of course either way it’s far below 85%, unless you allow for best-of-N sampling via a scoring model, which brings Claude 4 up to 80%.
I also did some normal searches to verify
but I think this LLM summary is pretty good.I misread one of the LLM’s tables, so the summary looks unreliable enough that I deleted it but normal searches still support my first comment, Erich posts some relevant links.I don’t know why it’s reporting the current SWE-Bench SOTA as 60%? The official leaderboard has Claude 4 Sonnet at 65%. (Epoch has Claude 4 Opus at 62%, but the originalPost linked the official leaderboard.) Of course either way it’s far below 85%, unless you allow for best-of-N sampling via a scoring model, which brings Claude 4 up to 80%.
Updated with a note.