tdko comments on tdko’s Shortform

tdko 28 Jul 2025 21:20 UTC
30 points
5
METR has finally tested Gemini 2.5 Pro (June Preview) and found its 50% success task horizon is only 39 minutes, far worse than o3 or Opus 4 which are at 90 and 80 minutes respectively. Probably shouldn’t be a gigantic update given 2.5 Pro never scored amazingly at SWE-Bench, but still worse than I expected given how good the model is otherwise.
- Garrett Baker 29 Jul 2025 14:20 UTC
  4 points
  2
  Parent
  This is interesting, Gemini 2.5 Pro has recently became my favorite model, especially over Opus (this from a long-time Claude user). I would not be surprised if I like it so much because of its lower task horizon, since its the one model I trust to not be uselessly sycophantic right now.
  - Aaron Staley 29 Jul 2025 16:22 UTC
    5 points
    2
    Parent
    Coding agentic abilities are different from general chatbot abilities. Gemini is IMO the best chatbot there is (just in terms of understanding context well if you wish to analyze text/learn things/etc.). Claude on the other hand is dead last among the big 3 (a steep change from a year ago) and my guess is Anthropic isn’t trying much anymore (focusing on.. agentic coding instead)
    - Garrett Baker 29 Jul 2025 16:32 UTC
      5 points
      2
      Parent
      Hm, I notably would not trust Claude to agentically code for me either. I went from heavily using Claude Code to occasionally asking Gemini questions, and that I think has been a big improvement.
      
      Given METER’s other work, the obvious hypothesis is that Claude Code is mostly just better at manipulating me into thinking they can easily do what I want.
  - uugr 29 Jul 2025 14:54 UTC
    1 point
    0
    Parent
    What’s the correlation between task horizon and useless sycophancy?
    - Garrett Baker 29 Jul 2025 15:04 UTC
      4 points
      2
      Parent
      I don’t know, subjectively it seems large, and it seems plausible they could be related.
- Aaron Staley 29 Jul 2025 16:20 UTC
  3 points
  0
  Parent
  I don’t see that producing much of an update. Its SWE-bench score as you note was only 59.6%, which naively maps to ~50 minutes METR.
  - Cole Wyeth 30 Jul 2025 0:28 UTC
    2 points
    0
    Parent
    I still think it’s comforting to observe that the task lengths are not increasing as quickly as feared.
    This is as I predicted so far but we’ll see about GPT-5.