β-redex comments on tdko’s Shortform

β-redex 7 Aug 2025 18:27 UTC
16 points
0
This 25m 80%-time horizon number seems like strong evidence against the superexponential model from ai-2027. On this graph the superexponential line shows 4h at the end of 2025. I feel like GPT-5 will be the biggest model release of the year, I don’t see how we would see a model with an 8x time horizon of GPT-5 this year.
- Aaron Staley 7 Aug 2025 18:36 UTC
  20 points
  12
  Parent
  The swe-bench scores are already well below trend from ai 2027. Had to hit 85% by end of month. We’re at 75%. (and SOTA was ~64% when they released ai 2027)
  - Bitnotri 7 Aug 2025 18:51 UTC
    8 points
    1
    Parent
    Gemini 3 should drop by the end of the month, we might hit that
    - Aaron Staley 7 Aug 2025 19:41 UTC
      7 points
      1
      Parent
      + 25% for swe-bench relative to Gemini 2.5? Quadrupling the METR task length of Gemini 2.5?
      
      I suppose it’s a possibility, albeit a remote one.
      - O O 7 Aug 2025 19:51 UTC
        6 points
        1
        Parent
        It seems Gemini was ahead of openai on the IMO gold. The output was more polished so presumably they achieved a gold worthy model earlier. I expect gemini’s swe bench to thus at least be ahead of OpenAI’s 75%.
        What links here?
        StanislavKrym's comment on ryan_greenblatt’s Shortform by ryan_greenblatt (8 Aug 2025 16:30 UTC; 3 points)
        Aaron Staley 7 Aug 2025 21:59 UTC
        5 points
        1
        Parent
        I don’t believe there’s a strong correlation between mathematical ability and agentic coding tasks (as opposed to competition coding tasks where a stronger correlation exists).
        Gemini 2.5 Pro is already was well ahead of O3 on IMO, but had worse swe-bench/METR scores.
        Claude is relatively bad at math but has hovered around SOTA on agentic coding.
- elifland 7 Aug 2025 22:04 UTC
  19 points
  4
  Parent
  I overall agree that things seem to be going slower than AI 2027 (and my median was longer when it came out).
  However as mentioned in the caption, the green curve is a simplified version of our original timelines model. Apologies about that, I think it’s reasonable to judge us based on that.
  FWIW though, the central superexponential Mar 2027 trajectory from our original model certainly is not strongly contradicted by GPT-5, both with and without an AI R&D speedup interpolation issue fixed.
  The original model, filtered for superexponential (pre-AI-R&D-automation) trajectories that reach superhuman coder in 2027:
  With AI R&D speedup bug fixed, also filtered for superexponential pre-AI-R&D-automation (backcast looks much better, GPT-5 prediction slightly worse):
  Either way, we’re now working on a much improved model which will likely have an interactive web app which will provide an improvement over this static graph, e.g. you’ll be able to try various parameter settings and see what time horizon trajectories they generate and how consistent they are with future data points.
  Note also that the above trajectories are from the original model, not the May update model which we unfortunately aren’t taking the time to create for various reasons, we think it would likely look a little worse in terms of the GPT-5 fit but might depend how you filter for which trajectories count as superexponential.
- bhalstead 7 Aug 2025 22:45 UTC
  7 points
  1
  Parent
  Registering that I don’t expect GPT-5 to be “the biggest model release of the year,” for various reasons. I would guess (based on the cost and speed) that the model is GPT-4.1-sized. Conditional on this, the total training compute is likely to be below the state of the art.
  - Alice Blair 7 Aug 2025 22:59 UTC
    1 point
    0
    Parent
    How did you determine the cost and speed of it, given that there is no unified model that we have access to, just some router between models? Unless I’m just misunderstanding something about what GPT-5 even is.
    - Josh You 8 Aug 2025 1:32 UTC
      1 point
      0
      Parent
      The router is only on ChatGPT, not the API, I believe. And it switches between two models of the same size and cost (GPT-5 with thinking and GPT-5 without thinking).