o3 and o4-mini solve more than zero of the >1hr tasks that claude 3.7 got ~zero on, including some >4hr tasks that no previous models we tested have done well on, so it’s not that models hit a wall at 1-4 hours. My guess is that the tasks they have been trained on are just more similar to HCAST tasks than RE-Bench tasks, though there are other possibilities.
Okay, that’s a lot more convincing. Was it available publicly? I seem to have missed it. Surprisingly high success probability on a 16 hour task—does this just mean it made some progress on the 16 hour task?
o3 and o4-mini solve more than zero of the >1hr tasks that claude 3.7 got ~zero on, including some >4hr tasks that no previous models we tested have done well on, so it’s not that models hit a wall at 1-4 hours. My guess is that the tasks they have been trained on are just more similar to HCAST tasks than RE-Bench tasks, though there are other possibilities.
Okay, that’s a lot more convincing. Was it available publicly? I seem to have missed it. Surprisingly high success probability on a 16 hour task—does this just mean it made some progress on the 16 hour task?
Pokémon fell today too, we might be cooked.