Well, the REBench tasks don’t all have the same length, at least in the data METR is using. It’s all tightly clustered around 8 hours though, so I take your point that it’s not a very meaningful correlation.
Well, the REBench tasks don’t all have the same length, at least in the data METR is using. It’s all tightly clustered around 8 hours though, so I take your point that it’s not a very meaningful correlation.