I think METR is aiming for expert level tasks, but I think their current task set is closer in difficulty to GAIA and VisualWebArena than what I would consider human expert level difficulty. It’s tricky to decide though, since LLMs circa 2024 seem really good at some stuff that is quite hard to humans, and bad at a set of stuff easy to humans. If the stuff they are currently bad at gets brought up to human level, without a decrease in skill at the stuff LLMs are above-human at, the result would be a system well into the superhuman range. So where we draw the line for human level necessarily involves a tricky value-weighting problem of the various skills involved.
I think METR is aiming for expert level tasks, but I think their current task set is closer in difficulty to GAIA and VisualWebArena than what I would consider human expert level difficulty. It’s tricky to decide though, since LLMs circa 2024 seem really good at some stuff that is quite hard to humans, and bad at a set of stuff easy to humans. If the stuff they are currently bad at gets brought up to human level, without a decrease in skill at the stuff LLMs are above-human at, the result would be a system well into the superhuman range. So where we draw the line for human level necessarily involves a tricky value-weighting problem of the various skills involved.