Aaron Staley comments on tdko’s Shortform

Aaron Staley 7 Aug 2025 18:36 UTC
20 points
12
The swe-bench scores are already well below trend from ai 2027. Had to hit 85% by end of month. We’re at 75%. (and SOTA was ~64% when they released ai 2027)
- Bitnotri 7 Aug 2025 18:51 UTC
  8 points
  1
  Parent
  Gemini 3 should drop by the end of the month, we might hit that
  - Aaron Staley 7 Aug 2025 19:41 UTC
    7 points
    1
    Parent
    + 25% for swe-bench relative to Gemini 2.5? Quadrupling the METR task length of Gemini 2.5?
    
    I suppose it’s a possibility, albeit a remote one.
    - O O 7 Aug 2025 19:51 UTC
      6 points
      1
      Parent
      It seems Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%.
      What links here?
      StanislavKrym's comment on ryan_greenblatt’s Shortform by ryan_greenblatt (8 Aug 2025 16:30 UTC; 3 points)
      - Aaron Staley 7 Aug 2025 21:59 UTC
        5 points
        1
        Parent
        I don’t believe there’s a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
        Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
        Claude is relatively bad at math but has hovered around SOTA on agentic coding.