I kind of expected an improvement after hearing (Anthropic’s unverified claims) it could work on SWE for 30 hours etc. Is zero-shotting games just not as closely connected to performing other long-term tasks as I thought? If it can’t beat Pokémon, it’s hard for me to believe it can have a very long (METR) task length score. It seems that multiple hour projects start to require some serious planning and online learning (and even maybe perception eventually, but maybe perception is the big difference?).
METR task lengths are based on the amount of time it would take a human to complete the task, not the amount of time it takes the model to complete the task, and particularly not the amount of time that the model can spend productively working on the task. There exist very large tasks where the LLM could accomplish large parts of the task, parts that take the LLM dozens of hours and would take a human hundreds of hours, but would be unable to accomplish the entire task. For example consider porting a complex flask application to rust—the standard MVC parts would probably go pretty smoothly and could easily take 30 hours of wall clock time, but certain nontrivial business logic and especially anything involving the migration of weirdly serialized data is likely to remain unfinished.
Any more details on Pokémon performance?
I kind of expected an improvement after hearing (Anthropic’s unverified claims) it could work on SWE for 30 hours etc. Is zero-shotting games just not as closely connected to performing other long-term tasks as I thought? If it can’t beat Pokémon, it’s hard for me to believe it can have a very long (METR) task length score. It seems that multiple hour projects start to require some serious planning and online learning (and even maybe perception eventually, but maybe perception is the big difference?).
METR task lengths are based on the amount of time it would take a human to complete the task, not the amount of time it takes the model to complete the task, and particularly not the amount of time that the model can spend productively working on the task. There exist very large tasks where the LLM could accomplish large parts of the task, parts that take the LLM dozens of hours and would take a human hundreds of hours, but would be unable to accomplish the entire task. For example consider porting a complex flask application to rust—the standard MVC parts would probably go pretty smoothly and could easily take 30 hours of wall clock time, but certain nontrivial business logic and especially anything involving the migration of weirdly serialized data is likely to remain unfinished.