I think there are a variety of explanations consistent with there being 2x uplift already:
The METR benchmark just isn’t precise enough
One-time gains from RLVR that caused a steeper slope in 2024-2025 have petered out, but they’ve been replaced by uplift
Models have reached some time horizon threshold where they’re increasingly useful
In the past, problems like reward hacking or poor generalization have limited real-world uplift, but these are solved enough to get 2x uplift.
My median guess would be something lower than 2x, but we just don’t have enough data.
I think there are a variety of explanations consistent with there being 2x uplift already:
The METR benchmark just isn’t precise enough
One-time gains from RLVR that caused a steeper slope in 2024-2025 have petered out, but they’ve been replaced by uplift
Models have reached some time horizon threshold where they’re increasingly useful
In the past, problems like reward hacking or poor generalization have limited real-world uplift, but these are solved enough to get 2x uplift.
My median guess would be something lower than 2x, but we just don’t have enough data.