I think the benchmark is intended to measure performance on an even narrower proxy than this—roughly, the sort of tasks involved in ordinary, everyday software engineering.
Note that METR has also published a subsequent attempt to broaden the class of activities, and has some suggestive results that the qualitative exponentially increasing time horizon phenomenon is somewhat robust, but the growth rate varies between domains.
Note that METR has also published a subsequent attempt to broaden the class of activities, and has some suggestive results that the qualitative exponentially increasing time horizon phenomenon is somewhat robust, but the growth rate varies between domains.