I think the diplomatic tone here is reasonable, and agree that METR is doing really good work. But this post updated me significantly away from taking the “task time horizon” metric at face value. If I saw this many degrees of freedom in a study from another domain, I’d wait for more studies before I held a real opinion.
That being said, I get a similar sinking feeling whenever I read ~any AI forecasting study in detail myself, nor do I know how I’d do better. There are just so many judgment calls at every level, and no great reason to presume they all cancel out.
I think the diplomatic tone here is reasonable, and agree that METR is doing really good work. But this post updated me significantly away from taking the “task time horizon” metric at face value. If I saw this many degrees of freedom in a study from another domain, I’d wait for more studies before I held a real opinion.
That being said, I get a similar sinking feeling whenever I read ~any AI forecasting study in detail myself, nor do I know how I’d do better. There are just so many judgment calls at every level, and no great reason to presume they all cancel out.