I’m confused how “they do directly hill climb on high profile metrics” is compatible with any of this, since that seems to imply that they do in fact train on benchmarks, which is the exact thing you just said was false?
I assume it means they use the benchmark as the test set, not the training set.
I assume it means they use the benchmark as the test set, not the training set.
Yes this.