My guess is it’s <1 hour per task assuming just copilot access, and much less if you’re allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you’d want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.
I guess I was thinking that the human baseline should be without LLMs, because otherwise I could just forward the prompt to the best LLM, se what they did, and perhaps improve upon it, which would put human level always at or above the best LLM.
Then again this is not how humans typically work now, so it’s unclear what is a «fair» comparison. I guess it depends on what the human baseline is supposed to represent, and you have probably thought a lot about that question at METR.
Is the reason you can’t do one of the existing tasks, just to get a sense of the difficulty?
I could, but it would not really be a fair comparison, since I have seen many of the LLMs solutions, and have seen what works.
Doing a fresh task I made myself would not be totally fair either, since I will know more about the data then they do, but it would definitely be closer to fair.
I guess I was thinking that the human baseline should be without LLMs, because otherwise I could just forward the prompt to the best LLM, se what they did, and perhaps improve upon it, which would put human level always at or above the best LLM.
Then again this is not how humans typically work now, so it’s unclear what is a «fair» comparison. I guess it depends on what the human baseline is supposed to represent, and you have probably thought a lot about that question at METR.
I could, but it would not really be a fair comparison, since I have seen many of the LLMs solutions, and have seen what works.
Doing a fresh task I made myself would not be totally fair either, since I will know more about the data then they do, but it would definitely be closer to fair.