Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. “output random tokens”) will outperform base models for some tasks by the pass@k metric.
It would be an extreme bias-variance tradeoff, yes.
It would be an extreme bias-variance tradeoff, yes.