james oofou comments on james oofou’s Shortform

james oofou 21 Jul 2025 18:45 UTC
5 points
3
I also forward-predicted it based specifically on METR research.
I think many thresholds in machine learning and mathematics can be analysed this way. The main barriers are (a) modeling hyperexponentiality as you get further out in time and (b) modeling things like RLVR hitting the compute ceiling, the new RL methods being hinted at by OA, etc.
Let’s take replacement of AI researchers as an example. AI research tasks in frontier labs rarely extend beyond 200 hour of working hours. Let’s ask for an 50% success rate. Let’s assume 3 month doubling times to take into account hyperexponential progress. Current time horizon is 180 minutes. We need 12000 minutes. We need 6 doubles (-1-> 360 −2-> 720 −3-> 1500 −4-> 3000 −5-> 6000 −6-> 12000). 6 * 3 = 18 months = early 2027. Therefore, AI-researcher level AIs by early 2027 seems not unlikely.
- faul_sname 21 Jul 2025 22:16 UTC
  2 points
  0
  Parent
  What are your predictions for OSWorld on Dec 31 of this year? Current SOTA is 45%. Of the 73 example tasks shown on the OSWorld data explorer, the 45th percentile task takes ~27 actions to complete, and we’ve got about two 3-month periods between now and EOY, so by a naive extrapolation we’d expect tasks up to about 100 steps to be solved by EOY. That’d be about 80%.
  
  That sounds quite high to me—and to Manifold as well, it seems. Do you endorse that prediction, or is there additional nuance to your prediction method that I’m not taking into account?
  - ryan_greenblatt 21 Jul 2025 22:33 UTC
    2 points
    0
    Parent
    The most naive method would be to use the extrapolation based on the trend on OSWorld here: https://www.lesswrong.com/posts/6KcP7tEe5hgvHbrSF/metr-how-does-time-horizon-vary-across-domains. My guess is that this yields sane results.
    
    The main delta is probably a slower doubling time.
  - james oofou 21 Jul 2025 22:58 UTC
    1 point
    0
    Parent
    OSWorld isn’t in machine learning or mathematics, so we don’t have much data to go on.
    But what we do have suggests ~4 month doubling time from which we arrive at an ~8 minute 50% time horizon by EOY, Given:
    > # Difficulty Split: Easy (<60s): 28.72%, Medium (60-180s): 40.11%, Hard (>180s): 30.17%
    This does suggest greater than 80% by EOY, but this depends on model release cadence etc.