I don’t see that producing much of an update. Its SWE-bench score as you note was only 59.6%, which naively maps to ~50 minutes METR.
I still think it’s comforting to observe that the task lengths are not increasing as quickly as feared.
This is as I predicted so far but we’ll see about GPT-5.
I don’t see that producing much of an update. Its SWE-bench score as you note was only 59.6%, which naively maps to ~50 minutes METR.
I still think it’s comforting to observe that the task lengths are not increasing as quickly as feared.
This is as I predicted so far but we’ll see about GPT-5.