Lower R^2 is a natural consequence of the x axis being shorter, in an experimental setup with both signal and noise. In the extreme, imagine that we only benchmarked models on 47 minute tasks. AIs will naturally do better at some 47 minute tasks than others. Since R^2 is variance explained by the correlation with x, and x is always 47 minutes, all the variance in task success rate will be due to other factors and the R^2 will be zero.
In fact this is why we constructed SWAA—to increase the R^2 in worlds where the trend is real. In a world where models were not actually better at tasks with shorter human time, the R^2 would be lower when you add more data, so it’s a very scientifically normal thing to do.
Lower R^2 is a natural consequence of the x axis being shorter, in an experimental setup with both signal and noise. In the extreme, imagine that we only benchmarked models on 47 minute tasks. AIs will naturally do better at some 47 minute tasks than others. Since R^2 is variance explained by the correlation with x, and x is always 47 minutes, all the variance in task success rate will be due to other factors and the R^2 will be zero.
In fact this is why we constructed SWAA—to increase the R^2 in worlds where the trend is real. In a world where models were not actually better at tasks with shorter human time, the R^2 would be lower when you add more data, so it’s a very scientifically normal thing to do.