METR task lengths are based on the amount of time it would take a human to complete the task, not the amount of time it takes the model to complete the task, and particularly not the amount of time that the model can spend productively working on the task. There exist very large tasks where the LLM could accomplish large parts of the task, parts that take the LLM dozens of hours and would take a human hundreds of hours, but would be unable to accomplish the entire task. For example consider porting a complex flask application to rust—the standard MVC parts would probably go pretty smoothly and could easily take 30 hours of wall clock time, but certain nontrivial business logic and especially anything involving the migration of weirdly serialized data is likely to remain unfinished.
METR task lengths are based on the amount of time it would take a human to complete the task, not the amount of time it takes the model to complete the task, and particularly not the amount of time that the model can spend productively working on the task. There exist very large tasks where the LLM could accomplish large parts of the task, parts that take the LLM dozens of hours and would take a human hundreds of hours, but would be unable to accomplish the entire task. For example consider porting a complex flask application to rust—the standard MVC parts would probably go pretty smoothly and could easily take 30 hours of wall clock time, but certain nontrivial business logic and especially anything involving the migration of weirdly serialized data is likely to remain unfinished.