Time horizon is currently defined as the human task length it takes to get a 50% success rate, fit with a logistic curve that goes from 0 to 100%. But we just did that because it was a good fit to the data. If METR finds evidence of label noise or tasks that are inherently hard to complete with more than human reliability, we can just fit a different logistic curve or switch methodologies (likely at the same time we upgrade our benchmark) so that time horizon more accurately reflects how much intellectual labor the AI can replace before needing human intervention.
Time horizon is currently defined as the human task length it takes to get a 50% success rate, fit with a logistic curve that goes from 0 to 100%. But we just did that because it was a good fit to the data. If METR finds evidence of label noise or tasks that are inherently hard to complete with more than human reliability, we can just fit a different logistic curve or switch methodologies (likely at the same time we upgrade our benchmark) so that time horizon more accurately reflects how much intellectual labor the AI can replace before needing human intervention.