What is uniquely interesting/valuable about METR time horizons is that the score is meaningful and interpretable. Can do software tasks that would take an expert 2h with 50% success probability is very specific. Has the score y on benchmark x is only valuable for comparisons, it does not tell you what’s going to happen when the models reach score z.
What is uniquely interesting/valuable about METR time horizons is that the score is meaningful and interpretable. Can do software tasks that would take an expert 2h with 50% success probability is very specific. Has the score y on benchmark x is only valuable for comparisons, it does not tell you what’s going to happen when the models reach score z.